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Abstract. We define a general class of network formation models, Statistical Expo- 
nential Random Graph Models (SERGMs), that nest standard exponential random 
graph models (ERGMs) as a special case. We analyze conditions for practical and 
consistent estimation of the network formation parameters. This addresses two holes 
in the estimation of exponential random graph models. First, although it is known 
that due to the enormity of the space of possible networks Markov chain Monte Carlo 
methods of estimation face slow (exponential) mixing times for many specifications, 
we provide a first set of results identifying nontrivial specifications for which practi- 
cal, accurate estimation is possible. Second, we provide consistency results showing 
when maximum likelihood and GMM (generalized method of moments) estimates of 
parameter converge to the true values in SERGMs. In particular, we show that a 
class of SERGMs that count subgraphs of various types are consistently estimated 
using direct and fast techniques. We also define a related class of network forma- 
tion models, SUGMs, and show that they are also consistently and easily estimated 
when networks are sufficiently sparse. We illustrate the application of the models and 
techniques with data. 
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I. Introduction 

...[A] pertinent form of statistical treatment would be one which deals 
with social configurations as wholes, and not with single series of facts, 
more or less artificially separated from the total picture. 
Jacob Levy Moreno and Helen Hall Jennings, 1938. 

Social networks shape our opinions and behaviors and so it is essential to have 
models for the estimation of network formation. Since an individual's decision to form 
a relationship with another individual can depend on whether they have friends in 
common, how many other relationships they are each involved in, and other aspects of 
their social positions, relationships are generally far from independent. This is seen in 
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many settings social and economic settings: from informal favor exchange where the 
presence of friends in common can facilitate robust favor exchange (e.g., Jackson et al. 
(2012)), to international trade agreements where the presence of one trade agreement 
can influence the formation of another (e.g., Furusawa and Konishi (2007)). Similarly, 
in forming a network of contacts in the context of a labor market, an individual benefits 
from relationships with others who are better-connected and hence relationships are 
not independently distributed (e.g., Calvo-Armengol (2004)); nor are they in a setting 
of risk-sharing (e.g., Bramoulle and Kranton (2007)). 

Once such interdependencies exist, estimation is no longer at the level of pairs of 
nodes, but must encompass more of the network, even the whole network, as reflected 
in the quote from Moreno and Jennings (1938) above. Given their ability to incorporate 
general dependencies, exponential random graph models (henceforth "ERGMs") have 
become the workhorse models for estimating network formation. 1 Indeed, as originally 
shown via a powerful theorem by Hammersley and Clifford (1971), the exponential form 
can nest any random graph model and can incorporate arbitrary interdependencies in 
connections. 2 Moreover, there are random utility foundations for these models, as we 
show and others have shown. 

Although such models are widely used and are seemingly natural tools for estimating 
the formation of networks with interdependent ties, there are two critical gaps in the 
understanding of these models of network formation and we address both of them in 
this paper. 

First, the number of possible networks on a given number of nodes is an exponential 
function of the number of nodes. Estimating the likelihood of a given network (as 
part of Bayesian, maximum likelihood, GMM, or other procedures) requires having 
some estimate of its likelihood relative to the other networks that could have appeared 
instead. Given the exponential explosion of the number of possible networks, 3 it is 
impossible to directly calculate likelihoods of a given network appearing and so some 
approximation is necessary. However, this computational hurdle is so formidable that 
even state-of-the-art algorithms for estimation can be inadequate and inaccurate. This 
is the subject of a burgeoning literature that shows that some MCMC estimation 
techniques cannot mix in less than exponential time for large classes of exponential 
random graph models (e.g., see Bhamidi et al. (2008) and Chatterjee et al. (2010)). 4 

1 These grew from work on what were known as Markov models (e.g., Frank and Strauss (1986)) or 
p* models (e.g., Wasserman and Pattison (1996)). An alternative approach is to simply work with 
regression models at the link (dyadic) level, but to allow for dependent error terms, as in the "MRQAP" 
approach (e.g., see Krackhardt (1988)). Although that approach can be quite clean econometrically, 
it is less well suited for identifying the incidence of particular network subgraph structures that may 
be implied by various social or economic theories. 

2 Their theorem applies to undirected and unweighted networks. See the discussion in Jackson (2008). 
Of course, the representation can become fairly complicated; but the point is that the ERGM model 
class is broadly encompassing. Furthermore, as we illustrate below, it can also be adapted to allow 
for multigraphs such that nodes can have multiple types of relationships. 

For example with only n = 30 nodes, the number of possible networks is 2 435 (which exceeds most 
estimates of the number of atoms in the universe) . 

4 Apart from estimation issues, there is another issue that concerns which formulations of exponential 
random graph models are distinguished from independent link models. For example, Chatterjee and 
Diaconis (2011) show that some classes of ERGMs are indistinguishable in a well-defined sense from 
independent-link models. Our formulations avoid such issues, and in particular our results on sparse 
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Second, there is little that is known about the consistency of estimators in such 
exponential random graph models. 5 It is important to emphasize that the relevant 
asymptotic frame is one in which the number of nodes grows to infinity, rather than 
the number of networks; as in many applications, the data consist of a single network. 6 
If links were all independent of each other, then a single network provides many obser- 
vations and link probabilities are estimable. However, once there are interdependencies 
between links, standard asymptotic techniques do not apply and that is why so little 
is known about the consistency of estimation techniques in this setting. This does not 
mean that consistency is precluded, (just as it is not in time series or spatial settings) 
as even though the probability must be specified at the network level there is still a 
lot of information that can be discerned from the observation of a single network. 7 
Nonetheless, it does mean that we must approach the asymptotics in a way that ac- 
counts for the potentially complex correlations and interdependencies that arise in link 
formation. 8 

To fix ideas and discuss our approach in more detail, let us detail the issues that 
ERGMs face. In an ERGM, the probability of a network of observing a particular 
network g with an associated vector of statistics S(g) (for example, density of links, 
number of cliques of given sizes, the average distance between nodes with various 
characteristics, counts of nodes with various degrees, etc.) is proportional to 



where 9 is a vector of model parameters. 9 Turning this into a probability requires 
normalizing this expression so that the probability of observing g is 



networks apply to classes of models that are well-distinguished from independent-link models and to 
which the Chatterjee and Diaconis (2011) results do not apply. 

Of course, consistency has been examined in the context of some random graph models (e.g., see 
Bickel et al. (2011); but not for the class of models that we consider here. Consistency of a different 
sort has been examined by Shalizi and Rinaldo (2012) in the context of ERGMs: whether or not if one 
observes only a subsample of the network, whether the estimated parameters will accurately reflect 
the true parameters of the full network. That is also related to work by ?, but we do not address that 
separate issue here. 

6 If the asymptotic frame was one in which we observed a growing number of networks with a fixed 
number of nodes, each generated by the same network formation process, consistency would be trivial. 
Although there exist such data sets, they are the exception rather than the rule. And, even for such 
data sets, the estimation issues that we tackle here are relevant. 

7 In cases where a time series of link formation is observed or postulated and estimated, then one can 
take advantage of the sequentially to see how each link forms conditional on the network in place 
at the time (e.g., see Christakis et al. (2010)). However, without such information, or in cases where 
links may further evolve over time, the network perspective again prevails. For example, Mele (2011) 
considers a sequential model and then shows that it becomes effectively equivalent to a certain ERGM. 
8 Clearly, there are other settings with interdependencies, such as time series and spatial settings. 
The network setting presents a complex set of interdependencies that do not permit off-the-shelf 
approaches. 

9 The reasons for using the exponential family is clear. It nests many standard distributions such as 
multivariate normal, Poisson, power, lognormal, exponential, gamma, beta, Weibull, Laplace, multi- 
nomial, etc. Also, it has many nice properties regarding its cumulants and moment generators, and 
has been well-studied in the statistics literature. 



exp(9-S(g)) 





exp (9 ■ S (g)) 
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The challenge is that estimating the parameters of such a model, by maximum likeli- 
hood, Bayesian, GMM, or other methods, requires estimating how the relative likeli- 
hood of a network relates to parameters. This involves either explicitly or implicitly 
estimating the denominator of (1.1). However, as we mentioned above, examining all 
possible networks g' is infeasible even for small n, and thus, one has to take other 
routes. 

The adaptation of Markov chain Monte Carlo (MCMC) sampling techniques to sam- 
ple networks and estimate ERGMs, by Snijders (2002) and Handcock (2003), provided 
a breakthrough and the subsequent development of computer programs based on those 
techniques led to their widespread use. 10 However, it was clear to the developers and 
practitioners that at least some specifications of ERGMs had convergence problems. 
Until recently, it remained unknown if or when these techniques would mix accurately 
in a feasible time. Given the enormity of the space any MCMC procedure would visit 
only an infinitesimal portion of the space, and it was unclear whether that was suffi- 
cient to gain an accurate picture. Unfortunately, important recent papers have shown 
that for broad classes of ERGMs some of these MCMC procedures will take exponen- 
tial time to mix unless the links in the network are approximately independent (e.g., 
see the discussions in Bhamidi et al. (2008) and Chatterjee et al. (2010)), in which 
case the ERGM was not needed as the network could then be viewed as a collection 
of independent edges and edge formation could be estimated directly. Such difficulties 
were well-known in practice to users of software that performs such estimations, as 
rerunning even simple models can lead to very different parameter and standard error 
estimates,but now are known to be more than an anomaly. 

Beyond the computational challenges, it is also not known whether various estimators 
of ERGMs are consistent. For example, when they are computable, will maximum 
likelihood estimates converge to the true parameters as the number of nodes becomes 
large? Given that data in many settings consist of a single network or a handful of 
networks, we are interested in asymptotics where the number of nodes in a graph grows. 
However, it may be the case that increasing the number of nodes does not increase 
the information in the system. In fact, for some sequences of network statistics and 
parameters it is obvious that the parameters of the associated ERGM are not consistent. 
For example, suppose that S(g) includes a count of the number of components in 
the network and the parameters are such that the network consists of a single or a 
few components. The limited number of components would not permit consistent 
estimate of the generative model. Thus, there are models where consistent estimation 
is precluded. On the other extreme where links are all independent, we know that 
consistent estimation would hold, and so the interesting question is for which models 
is it that consistent estimation can be obtained. 

We do four things in this paper. 

First, we propose a generalization of the class of ERGMs that we call SERGMs: 
Statistical ERGMs. To understand the generalization, it helps to note that ERGMs 
define a probability distribution on the space of sufficient statistics S(g) (e.g., frequen- 
cies of edges, cliques, stars, etc.) and conditional on having these characteristics, any 



'See Snijders et al. (2006) for more discussion. 
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network is equally likely. 11 Thus, ERGMs the probability of forming a network is de- 
termined by its statistics; for instance, having a given link density, a given clustering 
coefficient, specific path lengths, etc., and then any network exhibiting the same prop- 
erties measured by the model is equally likely. 12 Moreover, the weighting (or reference 
distribution) used over the space of statistics is then a special one: the probability of 
seeing specific statistics depends on the number of networks that exhibit that statistic. 
It is also difficult to compute, 13 and it is not clear that it should be preferred by the 
modeler. SERGMs nest the usual ERGM models by noting that: (i) we can define 
the model directly over the statistics thus greatly reducing the dimensionality of the 
space, and (ii) we can weight the distribution over the space of statistics in many ways 
other than simply by how many networks exhibit the statistics. Some of these refer- 
ence distributions generate natural models that both allow for realistic features as well 
as desirable statistical properties such as consistency and asymptotic normality of the 
estimators of model parameters. 

Second, we examine sufficient conditions as well as some necessary conditions for 
consistent estimation of SERGM parameters (nesting ERGMs as a special case) and 
identify a class of SERGMs for which it is both easy to check consistency and estimate 
parameters. Models in this class are based on "count" statistics (for instance, how many 
links between nodes with certain characteristics exist, how many triangles 14 including 
certain types of nodes exist, how many nodes have a given degree, etc.). By slightly 
altering the reference distribution, we show that a natural family of models exhibit 
desirable statistical properties. 

Third, we identify another class of network formation models that are based on the 
formation of subgraphs. We can think of such a network as being constructed from 
building blocks of varying sizes: links, triangles, larger cliques, stars, etc., layered 
upon each other. We show that if such models are sufficiently sparse in a well-defined 
way, then they are consistently estimable and parameters are asymptotically normally 
distributed. Such sparse networks appear in many applications as they have realistic 
features (e.g., average degree that grows at a rate less than n, but still allowing for high 
triadic closure, homophily, rich degree distributions, and so forth). These models are 
also easily to simulate from and admit random utility foundations that may be used 
in economic applications (e.g., counterfactual policy analysis, testing theory). 

Fourth, we provide an illustration of the techniques developed here by applying 
them to data on social networks from Indian villages. In particular, we also develop an 
extension of the models that apply to multigraphs (so individuals may have different 
sorts of edges between them). This allows us to test several theories of multiplexing: 
fixed costs of link formation, patterns that foster favor exchange, and correlated (un)- 
observables. We find evidence consistent with predictions of a theory of favor exchange: 
being a member of a triangle may substitute for having multiple types of relationships 
with others. 

11 This is related to the well-known property of sufficient statistics of the exponential family. 

12 As an analogy, a binomial distribution defines the probability of seeing x heads but does not care 

about the exact sequence under which the x heads arrive. 

13 Just as an example, calculating the number of graphs on n nodes with a given number t of triangles 
is a challenging combinatorial problem. 

14 Triangles refer to triads: cliques of size three; that is, triplets of nodes k that include all three 
possible links. 
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In terms of related literature, our second class of models (SUGMs) is not the first 
to build networks up from randomly generated subgraphs, as that is done by Bollobas 
et al. (2011). Nonetheless, the results, models, and foci differ significantly. We are 
interested in building broad classes of models that have practical and simple estimators, 
and showing that those estimators are consistent. This involves a special technique 
in ordering the statistics in the way that they are counted, so, for instance, a link 
generated in isolation is counted differently from one that appears embedded in a clique. 
Bollobas et al. (2011) is more concerned with properties of the resulting networks such 
as thresholds for giant components, degree distributions, and on clustering properties, 
and thus looks at models with particular rates of subgraph formation. 

The remainder of the paper is organized as follows. Section 2 introduces the frame- 
work, ERGMs, and discusses random utility foundations that are useful in economic 
applications. In Section 3 we introduce SERGMs. Consistent estimation of SERGMs 
is discussed in Section 4, and we also relate this to a weighted binomial distribution. 
In Section 5, we introduce another model of network formation based on subgraphs, 
which we show to be easily and directly estimable in the case of sparse networks, with 
consistent and asymptotically normally distributed estimators. We present simulation 
exercises as well as an empirical application to Indian village networks in Section 6, to 
show how well these models can mimic real-world network data. Section 7 describes ex- 
tensions and applications of these models, including how they may be used in analyzing 
strategic network formation as well as multigraphs. We then provide an application to 
Indian village network data to study multiplexing between favor-exchange and other 
types of social links. Section 8 concludes. 

2. Preliminaries and ERGMs 
We begin with background definitions and notation. 

2.1. Networks or Graphs. Let Q n be a set of possible graphs on a finite number n 
of nodes. The class can consist of undirected or directed graphs and unweighted or 
weighted graphs. We take Q n to be finite, but it will generally be large as a function of 
n. For instance, if Q n is the set of all undirected, unweighted graphs on n nodes, then 

the cardinality of Q n is \Q n \ = 2 GO. 

We often omit notation Q n and denote a generic network by g. Thus, if we write J2 g 
it is understood to mean Yl g eG n f° r whatever class of networks is relevant. 

Unless otherwise stated we take Q n to be the set of undirected, unweighted graphs; 
but as will be clear, the results extend readily to more general classes, including multi- 
graphs as we illustrate below. 

2.2. Network Statistics. Let S(g) = (Si(g), . . . ,Sk(g)) be a vector of statistics con- 
cerning a graph g e Q n , where Si : Q n — > R. 

Given a statistic S, let 

N S (s) :=\{geg n :S(g) = s}\ 

denote the number of graphs that result in a given statistic. 

Statistics can be many different things, including standard network characteristics: 
the number of links, the number of cliques of some given size, the number of links that 
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have neighbors in common, homophily measures, numbers of nodes with given degrees, 
average distances, and so forth. 

It is important to note that node characteristics can also be included in statistics. 
For example, nodes might be classified into some finite number of groups based on 
some characteristics, and then one can track the number of links between various types 
of nodes, various clustering and cohesion measures by types of nodes, and so forth. 
This permits the fitting of choice-based models, where the utility that an individual 
derives from a link to another depends on node characteristics and network position. 
See Section 7.1 for additional discussion. 

2.3. Standard ERGMs. A standard ERGM model takes the form 

, 01 n j3 / \ exp(0 -3{g)) exp(9-S(g)) f i fa \\\ 

(2 ' 1} Fe {9) = E^exptg-Sfr)) = m = 6XP {6 - S{9) - l0gim)) ' 

where = (0\, Of.) G M fc is a vector of real- valued parameters, and 

9' 

is the appropriate normalizing coefficient. 



2.4. Maximum Likelihood Estimation. Although our analysis has clear extensions 
to Bayesian estimation, we proceed mainly in the context of maximum likelihood esti- 
mation and of generalized method of moments. 

Given a model of the form (2.1) and an observed network g, the maximum likelihood 
estimator, denoted 0(g), is the that maximizes Pe (g). Equivalently, it is the that 
maximizes ■ S (g) — \og(ip(0)) and therefore satisfies 

s( 9 ) = vm/m, 

and, by standard properties of the exponential family (see (2.2)), this can be written 
as 

(2.2) S(g) = EjW)]. 

We discuss estimation in more detail below. 



2.5. Asymptotics. As mentioned in the introduction, although there are exceptions, 
many applications of interest involve observation of one (large) network. Given the 
interdependencies in links, this presents a challenge in estimation as there are not re- 
peated independent observations. Thus, the challenge is understanding when a large 
enough network contains enough information to accurately estimate parameters. Corre- 
spondingly, our asymptotic frame is one in which we observe a single network and want 
to recover parameters consistently as the number of nodes goes to infinity. Therefore 
we discuss sequences of ERGMs, a sequence of distributions over Q n , so that practical 
and consistent estimation of n is possible for large enough n. 
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2.6. Random Utility Rationalizations for ERGMs. Given the extensive litera- 
ture on strategic network formation, and the many applications where this is at least 
partly at play, it is important to be able to estimate strategic network formation mod- 
els. 

There are several ways in which ERGMs capture aspects of network formation that 
go beyond direct link values. One is that agents might value friends of friends, and 
thus wish to connect with others who have greater numbers of connections. As shown 
by Mele (2011) this leads to an ERGM specification that includes "two stars," or links 
Another is that agents might value bridging separate groups. Another is that agents 
might value having a relationship within a clique differently from having a relationship 
with someone with whom they do not share any friends. 15 

In Section 7.1 we detail specifications of SERGMs designed for strategic network 
formation. 



The expression for the normalizing coefficient of an ERGM, ip(6), is impossible to 
calculate directly with more than a few nodes, as the number of graphs on n nodes 
grows exponentially in n 2 . This presents obvious problems for Bayesian, maximum 
likelihood estimation, and methods of moments estimations. 

Our approach is based on re-specifying the model (to a more general class) that are 
based directly on the (sufficient) statistics S(g) and then also possibly renormalizing 
the weights that this procedure generates. 

3.1. An Example and Illustration of Computational Advantages. 

We begin with an example that illustrates some of the ideas of our approach. 

In many settings, the number of isolated nodes is significantly higher than one might 
observe at random in an Erdos-Renyi random graph. In fact, Moreno and Jennings 
(1938) estimated that there were more than ten million socially isolated individuals 
in the U.S. in the 1930's, many more times than would be expected if relationships 
were formed at random. This would be true in many settings in which individuals can 
decide not to be active and not to form any relationships, and is quite different from a 
setting where all nodes randomly form links and some just happen not to form any. 

Let us thus consider a simplest model that allows nodes to be isolates, separately 
from happening not to form links. 

In particular, let Si(g) denote the number of links in a network g and Si(g) denote 
the number of isolated nodes (with no links) under g. We can define a random network 
model where some nodes do not participate and then other nodes form links uniformly 
at random as in an Erdos-Renyi random graph. 

The associated ERGM is given by 



Valuing friends of friends relates to access to information and favors, as in the connections model 
of Jackson and Wolinsky (1996). Values to bridging appear in theories such as those underlying 
structural holes of Burt (1992). Values to cliques arise under various theories such as those of closure 
Coleman (1988) or favor exchange Jackson ct al. (2012). 



3. SERGMs and Estimation Techniques 



Pe(g) 



eMOLSM + erSrig)) 



expiOLSM + e&ig)) 
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If 61 = then this reduces to a standard Erdos-Renyi random network on all nodes. 
The more interesting case is where 9j 7^ 0. In that case, networks become more {9j > 0) 
or less (9j < 0) likely based on the number of isolated nodes. For instance, if 9j > 
then links would tend to be concentrated among a smaller number of nodes than under 
an Erdos-Renyi random graph, with a larger number of nodes with no links. 

Many socially generated networks have excessive number of isolated nodes, and this 
is a simple model that generates networks with such effects. 

The difficulty with estimating such a model is that the number of such networks in 

the calculation of iP(9l,9j) is 2v 2 ). With just 100 nodes this is 2 4950 , while estimates 
of the number of atoms in the universe is less than 2 258 (Schutz, 2003). 

Methods of estimating the parameters of an ERGM are thus forced to circumvent 
direct calculation of '4>{9i n 9j), and instead use approximation methods such as MCMC 
(Markov Chain Monte Carlo) techniques. 16 Even with this approach, the fraction of 
all networks that can be sampled is tiny, and more pressingly, the space is difficult to 
traverse in a representative fashion. For instance, if one samples say 10000 networks, 
then one samples on the order of 2 16 networks out of the possible 2 4950 on 100 nodes. 
Thus, unless one is very lucky or very knowledgeable in choosing which networks to 
sample and how many to sample of different types, the sample is unlikely to be even 
remotely representative of the possible configurations that might occur. Indeed the 
time before which an MCMC technique has a chance to sample enough networks to 
gain a representative sample can be exponential in n 2 and so is prohibitively large even 
with a small number of nodes. 17 ,1S 

The alternative approach that we explore here takes advantage of the following 
insight. 

Let Ns(sl,si) denote the number of networks that have exactly sl links and si 
isolated nodes. An approximation of Ns(sl, si) is 



K s (s L ,si) 



OCT') 



otherwise. 

The first expression (™^J is the number of ways in which one can choose sj isolates out 



n — s 



of n nodes, and the second expression ( v 2 J ) is then the number of ways in which one 



can have si links among the remaining nodes. This is not a precise count, since ( v s » 
allows for some isolated nodes among the nodes that are supposed to be connected. 
However, if sl ^> (n — Sj) log(n — Si), then for large n the number of networks among 



16 See Snijders (2002), Handcock (2003), as well as discussions in Snijdcrs ct al. (2006) and Jackson 
(2011). The rough intuition is that such methods sample some networks to get an impression of the 
relative likelihoods in terms of relative sizes of exp(6>LSi,((/) + 9iSi(g')) and thus develop a rough 
estimate of the relative likelihood of the observed data under various specifications of 9. 
17 This does not even include difficulties of sampling. For example, as discussed by Snijders et al. 
(2006), a technique of randomly changing links based on conditional probabilities of links existing for 
given parameters can get stuck at complete, empty, or other extreme networks. 

18 Bhamidi et al. (2008) show that, in estimating many classes of ERGMs, MCMC techniques using 
Glauber or Metropolis-Hasting 's dynamics (or any local Markov dynamic) mix in less than exponential 
time only if any finite group of edges are asymptotically independent. 
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the 2 that end up having isolated nodes goes to 0. 19 Thus, in situations where 



expected degree exceeds the threshold of log(ra — Si), K s (sl, si) becomes an accurate 
approximation to N$. In fact, we further simplify by approximating K by using 



Ks(sl,si) 



(.")( ( T>) ^<(V).and 
otherwise, 



where the only difference from K is not imposing the restriction that sl > 1 ^~y l - 

This is quite useful, since in settings where expected degree is above the threshold 
of log(n), we approximate 

il>(0 L ,9i)Kt Yl Kg(s L , Sl )exp(e L 8 L + e lSl ). 

(«Lj*l) 

Here the summation involves less than n(n(n — l)/2) entries, whereas the direct 
calculation involved 2 n ( n ~ 1 >' 2 calculations. So, for instance, if n = 100 then this is 
fewer than 495000 calculations, as opposed to 2 4950 . This makes the ERGM practically 
estimable, with a direct calculation of the likelihood. 

We can then further simplify this expression to reduce the calculations further. The 
sum on the right hand side of (3.1) is concentrated in a shrinking neighborhood around 
the expected (and, with probability going to one, realized) values of Sj, Sl, so that one 
needs only sum in a shrinking neighborhood of the realized values of Si, Sl (as we 
discuss in more detail below). 

3.1.1. An Example of the Problems with MCMC Estimation of ERGMs. To illustrate 
this we simulate this model with n = 30. Of these nodes, 20 nodes form an Erdos-Renyi 
subgraph, with edges forming with i.i.d. probability of roughly p = 1/3 (the realized 
matrix had .34). 20 Meanwhile, the remaining 10 nodes are isolated. We fit a simple 
model using a standard ERGM estimation software (statnet via R) and estimate the 
following over 25 different runs on the same fixed network: 

Figure 1 illustrates the potential instability associated with ERGM estimation using 
MCMC techniques. Even with only thirty nodes and a very simple two-statistic model, 
the algorithm for estimating the ERGM comes to very different estimates when rerun 
(and each MCMC run involves thousands of iterations). Even if an estimate provides 
for nontrivial standard errors on the estimates, a reliable algorithm should reproduce 
nearly the same estimates off of identical data when replicated. Given that this is a 
random sampling algorithm, some noise might be expected, but not of this magnitude. 21 
Table 1 provides more details on the distribution of parameter estimates. 



19 This is a corollary to the fact that the threshold for connectivity in the "G(n, M)" class of random- 
graph models that Erdos-Renyi considered (the model where each possible network on n nodes having 
M links is given equal probability) is log(n) (see ?). This implies that the fraction of networks with 
M = (' 2 l )]3 links that have any isolated nodes goes to above the threshold where p = log(n)/n. 
20 This yields 190 observations of edge formation and 65 links were present and all nodes were con- 
nected. 

21 In fact, the parameters are well outside of the average standard errors that the program provides by 
a bootstrapping method, which are on the order of 0.11 for 9l and 0.2 for dj. The standard errors in 
the program presume that the sample of draws in its estimation provide a good approximation for the 
population, and so the bootstrapping can be very inaccurate given that even thousands of iterations 
provides a completely inaccurate sample of the true population of networks and coefficients. 
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Figure 1. Histograms of estimated ERGM parameters by statnet 
across different runs of the same program on the same network. The 
left-hand figure is the histogram of estimated parameters on links, while 
the right-hand figure is the histogram of estimated parameters on iso- 
lates. The red arrow indicates the true parameters, which can also found 
directly using our estimation techniques as discussed below. 
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0i 


Mean 


-0.647 


9.52 


S.D. 


0.1947 


18.456 


Median 


-0.5992 


15.8407 


Min 


-1.2564 


-23.3957 


Max 


-0.4401 


59.213 



Table 1 



The estimation with our SERGM technique leads to 6l = —.65 and 6j = —.68 
(without any iterations as the likelihood is now a direct calculation). 

It is worth noting that in the given model there is an even more direct and obvi- 
ous estimation method. One can simply estimate pi and pi as the probabilities of 
being isolated and then links (conditional on not being chosen as an isolated node) 
by frequency counts. It follows that the associated 9 are such that 6j = log j^ 2 — for 

j G {/, L}. However, we have introduced the machinery of the if(s)-functions to pre- 
view a more general form of ERGMs that the researcher can study. In certain (sparse 
enough) cases, as discussed in Section 5, the even more direct estimators provide very 
fast and accurate estimates of this and many other specifications. 

Before proceeding, we note that we used a model of isolates and links for illustrations, 
but there is nothing peculiar or special about this model in terms of the difficulties it 
poses for ERGMs nor for the improvements that SERGMs can offer in estimation. 
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3.2. ERGMs in a Statistical Representation. 

As we saw from the previous example, it will generally be useful to express ERGMs in 
an alternative representation, taking advantage of the sufficient statistics of the realized 
networks. Instead of considering the probability of observing a particular network, we 
instead ask what the probability is of observing a particular realization of network 
statistics: for instance, what is the probability of observing a network with a given link 
density, clustering coefficient, number of cliques of some size, homophily index, and so 
forth. It is in the properties of the network that our theories lie. 

More specifically, consider the probability of observing S(g) = s: 

p f of \ \ V- ex P(A - g (g)) 

and since ip{6) = J2 S ' Ns(s') exp (9 ■ s') it follows that 

iV 5 (s)exp (6 ■ s) 



Pe (S(g) = s) 



Generally, the range A of S is of much lower dimension than the set of all networks 
on n nodes. Thus, to the extent that one can compute/approximate N s (s), as we saw 
in the example given in Section 3.1 this can be a much easier formulation to work it. 
In fact, in many cases it makes sense to free up the N$ weights to a more general class, 
which can further simplify estimation and consistency. 

3.3. Statistical ERGMs (SERGMs). 

Note that (3.2) defines a model over the range A of potential network statistics of 
interest and, in principle, there is nothing special about the weighting Ns(-) function 
that is used over A. This leads us to our more general representation. 

Consider a vector of network statistics S = (Si, . . . , Sk) with potential values that 
lie in A C M fc . 22 Any weighting function K$(') defined on A, together with a set of 
parameters G define a statistical exponential random graph model, or SERGM 
for short. The associated probability of seeing realized statistics S = s is: 

f o-,\ p rc_ \ _ K s( s ) ex P (fi • s ) _ K s(s) exp Qg ■ s) 

(3.1) ^, Ks (b-s)- ^ Ks(sf) ^ ^ g/) - m 

where the normalizing coefficient is 

(3.2) m= E^(s') exp 

This is a model that states that the probability that a network exhibits a specific 
realization of statistics S = s is given by an exponential function of the statistics 
s. Thus, the model is specified specifically on the properties of the network rather 
than the actual realized network. Given the realized statistics s, the network g such 
that S(g) = s can be thought of as being drawn uniformly at random, but in fact 
that is inconsequential since the model is really about network properties rather than 
particular realized networks. 

In the language of exponential families of random variables, Ks(-) is simply a refer- 
ence distribution. Varying the reference distribution, of course, changes the resulting 



22 Given the finite number of possible networks, A is taken to be finite. If one wishes to work with 
weighted networks, then obvious extensions to continuous ranges and integrals apply. 
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odds of various values of s being drawn and can affect whether the model is consis- 
tently estimable. The special case in which K$ = N$ is corresponds to an ERGM in a 
statistical representation. 
Several remarks are in order. 

First, note that there is no reason to maintain that ifs(-)'s must approximate Ns(-)'s. 
Nature may choose properties of networks (5"s) according to some alternative weight- 
ings. The instance of studying iVs(-)-weighted SERGMs may be a historical one: on 
another planet, people may have first modeled SERGMs and would see those as natural 
with the ERGMs being a special case. 

Second, it can still be that one is interested in a specific sub-class of these models 
wherein the Ks(-)'s are chosen to approximate, locally, the behavior of AT 5 (-)'s, as we 
did in our isolates example. And, as we shall see, viewing things from the statistical 
perspective can help substantially with respect to estimation. 

Third, one way to interpret this model is as follows. A distribution on properties 
S G A drives the network formation process, and those network properties are what 
the researcher ultimately cares about with respect to the social or economic theories 
being studied. Which network forms given the realized statistics is secondary and could 
be uniform at random, or according to some other conditional distribution, so long as 
given the realized s a network g such that S(g) = s is drawn. 23 



3.4. Estimation. 

Let us now examine maximum likelihood estimation for the general class of SERGMs, 
which nests the maximum likelihood estimation of traditional ERGMs that we saw in 
(2-2). 

The maximum likelihood estimator /3(s) of a model of the form (3.1) is the maximizer 
of (s), or equivalently: 

(3(s) = argmaxlog (P^ (s)) = argmax/3 • s — log(0(/3)) + log(Ks(s)). 

f3 (3 

Thus, except for extreme cases, it satisfies 

s = V<f>0(s))/<t>0(s)), 
and so s = V«s(£)|t=o = E-g> AS], where k is the cumulant. Therefore, 
,oox gg^MgQexp (P-s')s' 



23 Note that any SERGM viewed as a distribution over statistics defined with weights K$ and pa- 
rameters /3 also generates an ERGM, as it generates a distribution over networks, for example taking 
networks to be drawn uniformly at random from those with the given statistics. That ERGM can be 
written as 

p f s §gexp(/3' s (. 9 )) exp08'*GO +/„(*)) 



E 3 ' exp (P>s(g)) E„ exp (P's(g) + f n (s(g'))) 

where /„(s) = log ( j^f) J Therefore the model is modified by a shift with weights log (Ks(s)/Ns(s)). 
The sum here is only over g' for which there is some realizable statistic value so that Ns(s(g')) > 0. 
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So, in a case where the SERGM is sufficiently identified, so that ft ^ ft' implies that 



provided that this is well-defined. For extreme values of s this will not be well-defined. 
Here, we implicitly assume that the model is specified so that the probability of ob- 
serving extreme statistics for which this is not satisfied is negligible, which will be true 
of the asymptotic specifications that we work with provided that the ft's do not tend 
to extremes too quickly. 

Of course, this also suggests that we may be interested in a GMM estimator associ- 
ated with the first order condition equations normalized by the rate of growth of the 
associated network statistics S, or some other normalization. For example, we may 
have a diagonal matrix C with positive diagonal entries such that we are interested in 
the estimator 



The maximum likelihood estimator discussed above, s = E^ [S], then is the case where 
the min is achieved with a value of 0, whereas the above specification allows for the 
more general case. 



The next issue that we address is the consistency of estimation of SERGMs: Under 
what conditions does an estimator of a SERGM converge to the correct estimate in 
probability? To our knowledge there are no results on consistency of ERGM estimators. 
The primary challenge is that the data consists of a single network and the asymptotics 
are in terms of the number of nodes. Moreover, the interest in SERGMs is in the the 
fact that relationships are correlated, and so the data can be far from independent. 

As with any estimation problem, for an estimator to be consistent, two things need 
to balance. First, the data have to have sufficient variation to be informative, i.e., 
the (norm of the) covariance of the data has to grow. That is, more information 
must be obtained from each additional unit of data (the change in statistics associated 
with each additional node). Second, the variation cannot grow too fast - that is, the 
rate of growth has to be such that, when appropriately normalized, the statistics will 
concentrate around their expected values so that the data induce the correct estimate 
of parameters. Both of these are usually guaranteed by having growing numbers of 
independent, or at least asymptotically independent (satisfying a mixing property) 
observations, so that new information comes in and a law of large numbers applies 
so that some normalized statistic concentrates around its mean. The challenge here 

24 For example, for a simple Erdos-Renyi random network where the count statistic is simply the 
number of links in the network, then if turns out that all links are present so that s = n(n— l)/2, then 

the (5 that maximizes the likelihood of the ERGM formulation is essentially infinite (the (3 = log ( j^— J 



corresponding to the maximum likelihood estimator of the link probability (p = 1) is not well-defined). 
For more on the nonexistence of well-defined maximum likelihood estimates for extreme networks see 
Rinaldo ct al. (2011) . 




(3.5) 




4. Consistency 
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is that we cannot rely on approximate independence, and so must work directly with 
these conditions. 

In our context, having sufficient variation rules out certain sorts of statistics. For 
instance, recall the example mentioned in the introduction such that the probability of 
a network depends on the number of components in the network. If the parameters are 
such that with high probability the network generated has only a few components, then 
there will be insufficient information in the data to accurately estimate the parameters 
of the model. We can thus see that statistics will typically have to take on a wide array 
of values in order to provide for accurate estimation, but that alone will be insufficient. 
The concentration about the mean will also be required. This requires that there not 
be too much interdependence between all the relationships in the network, so that 
asymptotically there is sufficient information for parameter estimation. 25 

We proceed as follows. First, we provide propositions that generally characterize 
consistency of SERGM estimation. Next, we then look at some specific SERGMs to 
show that they satisfy those conditions. Third, we also show that these classes of 
SERGMs can be easily (quickly) and accurately estimated. 

4.1. Consistency in Our Context. 

We begin with a general characterization of consistency of SERGMs, and then show 
how these conditions apply in specific contexts. 

Consider a sequence of SERGMs (S n , K$n, A n , (3 n ) as defined in (3.1). 

A sequence of SERGMs is expectations-identified with respect to a sequence of di- 
agonal matrices C n > with positive diagonal entries if there exists 7 > such that 

\C n E /3 [S n )-C n E /3 n[S n }\> 1 \/3-(3 n \ 

for all n. 26 

Expectations identification is an intuitive condition that requires that different pa- 
rameters distinguish themselves with different means. It is a sort of minimal condition 
since if two different parameter values generate very similar expected statistics, then 
observing the realized statistic will not allow us to distinguish the parameters. It en- 
codes, as shown in the appendix, a sufficient condition for the application of a uniform 
law of large numbers; thus, provided pointwise convergence is satisfied, one can show 
that parameter of an extremum estimator are consistent. 

A sequence of SERGMs is concentrated with respect to a sequence of diagonal ma- 
trices C n > with positive diagonal entries if 

C n {S n - E^[S n }) -^0 for f3 n e B. 

The concentration condition requires that there is some normalization (C n ) for which 
the statistics will concentrate around their means. As we have not scaled statistics 



This can be satisfied if there is some geography to nodes and links between distant ones are essentially 
independent. This is related to the domain increasing asymptotics in the spatial literature, and also 
to random geometric graphs (RGG, see, e.g., ?), as well as a variety of latent space models. Without 
a natural embedding of nodes in some latent space, making assumptions of approximate independence 
or mixing conditions are stronger than we would like to use here. 

26 Here the subscript notation E^S 1 ™] indicates that the expectation takes the probability to be spec- 
ified by (3.1) with parameters (S n , K% n , A n , fi). 
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we have to allow for some renormalizations. Viewed as a moment problem, this is 
essentially just pointwise convergence. 

Note that the choice of C n in the following theorem links them across the two con- 
ditions. To guarantee concentration there has to exist a sequence of C n that goes to 
fast enough, while to guarantee expectations identification they cannot go to too 
quickly. So, the C n 's identify the rate at which the statistics approach their means. 
The key to verifying consistency is then seeing whether there exists such sequences for 
which both conditions hold simultaneously. 

Proposition 1. If a sequence of SERGMs (S n ,Kg n ,A n , (3 n ) is expectations-identified 
and concentrated with respect to some C n , then the random vectors are consistent so 
that 

\j3 n (S n ) -/3 n \ -Ao. 

The proof of Proposition 1 is relatively routine in following standard consistency the- 
orem approaches and appears in the appendix. 

It is also useful to state a ratio version of Proposition 1 to address cases where 
parameters are, for instance, close to 0. Thus, we wish to have a stronger notion 
of consistency, not only requiring that the estimator approaches the true parameter, 
but that it does so in terms of a ratio. This requires corresponding definitions of 
concentration and identification that are ratio based. We do this in Appendix A in 
Proposition B.l. 

The result is fairly tight in that a version of a converse holds as well. In particular, 
the following result holds. 28 

A sequence of SERGMs (S n , Kg„, A n , (3 n ), is rate-expectations-identified with rates 
given by a sequence of diagonal matrices C n > with positive diagonal entries if there 
exist jh > 1l > such that 

lH \f3 - (3 n \ > \C n Ep[S n ] - C n Ep n [S n )\ > lL \(3 - /3 n \ 

for all n. 

Rate-expectations-identification is a condition that says that the sequence of diagonal 
matrices C n accurately captures the rate at which the expected statistics E /3 [S' n ] nears 
E^nfg' 1 ]! as we let the vector of parameters (3 approach (3 n . 

Proposition 2. If a sequence of SERGMs (S n , K£ n , A n , f3 n ) is rate- expectations- 
identified with rates C n , then the random vectors are consistent (i.e., \(3 n (S n )— (3 n \ — > 0) 
if and only if the sequence is concentrated with respect to (3 n and the same sequence of 
C n . 

The above proposition(s) provide high-level conditions wherein the concentration of 
the sufficient statistics about their expectations, when normalized, delivers consistency. 
It is then important to know which models will satisfy the concentration requirement. 

27 Notice that the exponential term in the associated likelihood can be written exp (5"/3) = 
exp (S'CnC^ 1 (3) and we are interested in the associated parameters f3, not C" 1 /? which will typi- 
cally trail off to infinity at polynomial rates in n. 

28 We state a version for Proposition 1, and the analog for ratio-consistency of the type in Proposition 
B.l in the appendix is left to the reader. 
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In the case of iid, time series or even spatial mixing data, we would typically use a 
concentration inequality such as Chebyshev's inequality or a Chernoff bound to bound 
the estimate of the mean by some moments and mixing coefficients. 

In the appendix we also discuss a characterization of consistency in terms of variance 
of the statistics in greater detail. The argument is similar to those for standard esti- 
mators, e.g. Amemiya (1973): For consistency, we need enough variation so that the 
system accumulates information and concentrates around its mean. This corresponds 
to needing the norm of the variance matrix tending to infinity, just as we would in 
typical regression-like applications. 

4.2. A Class of Consistent and Easily Estimable SERGMs: Simple Count 
SERGMs. 

To put the above consistency results to work, we now define a natural class of 
SERGMs that have consistent estimation as well as very straightforward and practical 
techniques for estimation. 

Some ERGMs in which statistics involve counts of subgraphs, can be difficult to 
estimate, since the associated N s (-Ys can be difficult to calculate. We now discuss an 
alternative specification, with substituted A's's that circumvents these difficulties. 

Suppose that S n = (S*™, . . . , S%:) is an /c-dimensional vector of network statistics 
whose £-th entry takes on non-negative integer values with a maximum value S e — > oo. 
We call such a SERGM specified with K n (s) = Ue (J) a simple count SERGM. Let 

let C n = Diag j ^ be the associated normalizing matrix. 

In a simple count SERGM, each statistic can be thought of as counting some aspect 
of the network: the number of links between nodes of various types, various types of 
cliques, other subgraphs, the number of pairs of nodes at less than some distance from 
each other, etc. Note that the class of statistics in a simple count SERGM encompasses 
most cases of interest. The difference from a standard ERGM is in the specification of 
the weights K$ that follow a natural counting weight. 

Associated with any vector of count statistics S n on n nodes is a possible range of 
values. It could be that there are cross restrictions on these values. For example, if we 
count links S2 and isolates S'j, then S2 cannot exceed (™~ 2 5/ )' ^ n case 

A n = l(s L ,si) : sj G {0,l,...,n},s L E |o, 1, . . . , 

A sequence of simple count SERGMs (S n , K n , A n , (3 n ) have statistics that are not 
conflicted if there exists some e > such that 

II { L% m (1 - e)\ , . . . , [Epn (1 + e) J } C A n 
i 

for all large enough n. 29 

29 E^™ [S 1 ™] refers to the expectation taken with respect to the one dimensional distribution of Sf 
ignoring the other statistics: i.e., with respect to a SERGM ^ Se ^ P }f l ) , . . This is slightly 

^^<s" K se( s e> cx P(P' s i) 

different from the condition, since it takes expectations with respect to the unconstrained range of S" 
rather than cross restrictions imposed under A 71 . 
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Asking that statistics not be conflicted is a fairly simple condition. To see the issue, 
consider asking for a network on n nodes that has n isolates and a positive number 
of links is infeasible. Those realizations are conflicted: no network exists with that 
combination of statistics. The "not conflicted" condition simply asks that at least in 
some small neighborhood of the unconstrained expected values of the statistics - as 
if they were each counted completely on their own, they are not conflicted so that 
they are jointly feasible. So for example, if one expects one third of the nodes to be 
isolated, and one out of five links to be present, then with large numbers of nodes, 
these statistics are unlikely to be in conflict. This condition is quite easy to satisfy. 30 

Theorem 1. A sequence of simple count SERGMs that are not conflicted is consistent 



Furthermore, there is a direct estimator available for any model of this type. One can 
simply estimate parameters pi from a distribution Bin [S^,pe) with se as its realization. 



It is important to emphasize that this class of SERGMs still allows for strong in- 
terdependences and correlations in link appearances, both within and across statistics. 
What it takes advantage of is a local approximation of such SERGM distributions in 
unconflicted regions. 

Theorem 1 tells us that unconflicted simple count SERGMs form a consistently es- 
timable class whose statistical properties we understand very well. An advantage of 
such SERGMs is that they have easy-to-understand and important statistical proper- 
ties. More generally, this exercise points out that we could also ask about which sorts 
of other K$ reference distributions lead to consistency of the associated SERGM. We 
did this above with a very natural class, but one could work with others that embody 
other ways in which a formation process might lead to weights on statistics. 

5. A Class of Estimable Sparse Network Formation Models 

We now outline another new model of network formation that is particularly well- 
suited for estimation in the case of sparse networks. This model also takes its inspi- 
ration from the idea that subnetwork configurations are the center of the formation 
process. Here a network is directly formed through the formation of subnetwork con- 
figurations: various cliques, links, stars, etc., form, and the resulting network is their 
union. 



There are other things also embodied in the condition, as there are certain counts of statistics that 
might not be feasible: for instance it is not possible to have a network with only one triangle missing: 
once one triangle is removed it also removes many others from the network. Thus, the range of some 
statistics is not a connected (containing all adjacent entries) subset of the integers. Nonetheless, for 
lower values of triangles, this is not an issue. Generally, in the relatively sparse ranges of networks 
that are often of empirical interest, this condition can be easily satisfied. 



(i.e., \p n -p n \-^o). 



Moreover, the parameters are asymptotically normally distributed: 




Then/3, := log (p t /(l - p e )). 
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The reason for considering this class of models is that it is distinct from the simple 
count SERGMs, in that the network is directly formed from the subgraphs that are 
counted. Thus, the network is directly formed, rather than having the statistics, and 
then the network being ancillary. Both are valid sorts of models, and distinct enough 
that it is valuable to present both. 

Moreover, as we show, in sparse settings there are simple and accurate (consistent) 
estimation techniques for this class of other class of models based on binomial formulas. 

These models are widely applicable and estimable, since many social and economic 
networks of interest are sparse in the sense that the fraction of links that are present 
versus those absent is small. There are billions of people in the world and typically 
people have ongoing relationships with at most hundreds of others; there are tens of 
thousands of researchers in economics and typically authors co-author with handfuls 
of others; a typical village may have a thousand members each with approximately 
twenty friends; and so forth. 

5.1. A Subgraph Generation Model: SUGM. 

We now define this other model of network formation, that we sometimes refer to as 
a "SUGM" or a "Subgraph Model" or "Subgraph Generation Model". 

A subgraph statistic S™(g) is a count of how many instances of some particular 
(path-connected) subnetwork on nodes appear in g. 31 For example it could be a 
count of links, triangles, stars comprising m nodes ("m — 1 stars"), m-cliques. These 
could also be more intricate subgraphs, for example, groups of m nodes that contain 
at least one node with characteristics X, another node with characteristics X' , etc., 
and at least some number of links. These could also be directed subgraphs in the case 
of a directed network. 32 

Consider a vector of subgraph statistics S n = (S 1 ™, . . . , S%), and corresponding pa- 
rameters p n = (pi, . . . ,Pk). 

A network is randomly formed as follows. First, each of the possible subnetworks 
counted by is independently formed with a probability p\. Next, each of the possible 
subnetworks counted by S% * s independently formed with a probability p%. This process 
continues, through each of the different types of subnetworks counted under S n . This 
results in a network g. So, the process is one where various subgraphs are formed, 
including whatever cliques, stars, links, or other particular network structures that we 
think could be an important part of a network formation. 

So, as an example with triangles and links not in triangles, triangles are formed, and 
then links that are not already part of any triangle (including incidentally generated 
triangles) are formed. These could depend on node characteristics: there might be 
different probabilities for triangles forming based on which types of nodes are in the 
possible triangle, and similarly for links. 

This differs from the SERGM model from Section 4.2 because the actual statistics 
generated are not directly observed. In particular, there can be a bunch of incidentally 
generated subnetworks. So the actual counts of statistics under the resulting g can 
differ from the number that were formed directly under the process. Backing out how 

31 Formally, there is a set T-L = {Hi(X), H^X)} of representative graphs, possibly depending on 
covariates X, of mi nodes for £ — l,..,fc. Sf is given by the number of injective homomorphisms 
Hi <-t g. 

32 This definition does not admit isolated nodes, but those can also easily be counted. 
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many were actually formed directly is important in estimating the true parameters of 
the model, the p™'s. 

We show that such a model can be easily and consistently estimated if it is sparse 
enough, in a well-defined way. 

5.2. Incidentally Generated Subgraphs. 

To see why sparsity is important in estimation of SUGMs, consider the following 
challenge. Suppose that triangles are formed as part of a process of forming a net- 
work. In order to estimate the network formation process, we need to know how many 
triangles were formed. As an example, suppose that the triangles that were directly 
formed (the cliques that formed) were (1,2,3), (3,4,5), (1,4,6), and (7,8,9). There 
were really "four" triangles formed; however, we end up with an additional triangle: 
(1, 3, 4). Thus it is impossible to include those four triangles without ending up with a 
fifth triangle. This sort of problem means that some of the arrangements of 4 triangles 
result in some extras, and so are not really arrangements that have four triangles. This 
is a challenge in estimating a parameter related to triangle formation, since some of 
the triangles that we observe were truly generated in the formation process, and others 
were "incidentally generated". 

The key insight in terms of estimation, is that in cases where networks are sparse, 
then the fraction of incidentally generated subgraphs vanishes. 

In order to state our results on this class, we first need a few definitions. Consider 
a sequence of random graph models indexed by n, each depending on /c-dimensional 
statistics S n , where k is fixed for the sequence. 

For each I G {1, . . . , k}, let be the set of all subnetworks counted by a subgraph 
statistic S™. In particular, let g G G\ be listed as a set of links. 33 

Let us say that a vector of subgraph statistics S n = (5*™, . . . S^:) is nicely-ordered 
if the subnetwork counted by Sf cannot be a subnetwork of the subnetwork counted 
by S$ for k >£'>£> 1 and S$(g) counts subnetworks that are not subnetworks of 
subnetwork already counted by some £ < £': 

(i) gt G and gi< G G™, implies that gt (jL ge> 

(ii) S$(g) = \{ge G G%, : g e C g and g e <f_ g £ for any g t G G% such that g t C 
g for some £ < £'}\. 

Note that any set of subgraph statistics can be nicely ordered: simply order the sta- 
tistics so that £' > £ implies mg> < rri£. 

So, for example with triangles and links: first we count triangles and second we would 
count links that are not part of any triangles. The idea is that links are generated 
through the generation of triangles, and then also through direct generation. We then 
wish to estimate which things were really generated as triangles and which things were 
really directly generated as links that are not part of some triangle. On sparse enough 
graphs, the contamination between these two will be minimal as there will be few 
things that we mistakenly identify as triangles that were really incidentally generated 
by other triangles and links. Of course, links that are not part of any triangle could 
not have been incidentally generated. Thus, the nice ordering tells us that we should 
first count triangles and next count links that are not part of some triangle. If we were 



For example, if g is a triangle, then G" = {g : g — {ij,jk,ki}}, where we let ij indicate a link 
between nodes i and j and exclude self links so that i, j and k are distinct. 
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counting various cliques, we would first count the largest cliques, then the next largest 
that are not part of any largest clique, and so forth. 

The following definitions allow us to be precise about incidental generation and 
sparsity. 

Consider some subgraph g £ G G . We say that gi is incidentally generated by a set 
of subgraphs {g^}jej, indexed by J, where G G lj for each j for some £j > £, M if 
(i) gt C U jeJ g J , and 

(ii) there is no j' G J such that ge C Uj^jj^jigi 

Part (i) states that the subgraph is incidentally generated, and part (ii) of the condition 
ensures that the set of generating subgraphs is minimal. 35 

As an example, suppose that we are counting triangles and single links, so that S%(g) 
is the number of triangles that are a subset of g, and Si(g) is the number of links that 
are not part of any triangle that are a subset of g. The triangle {12,23,31} could 
be generated by the networks g 1 ,^ 2 ,^ 3 where g 1 = {12,24,41}, g 2 = {23,25,53} and 
2 3 = {31}. 




© 
© © © 

(a) n nodes (b) Triangle 124 forms 



(c) Triangle 235 forms (d) Link 13 forms (e) Triangle 123 is 

incidentally gener- 
ated 

Figure 2. This figure shows the emergence of an incidentally generated 
triangle. 

Figure 2 provides an illustration. 



It is possible that £j — iy for some j and j', so that some j's could correspond to different subgraphs 
of the same type - e.g., a triangle being incidentally generated by other triangles. 
35 0f course, a subgraph could be generated in multiple ways at once. The minimality condition 
identifies minimal classes that are needed for generation. 
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5.2.1. Generating Classes. 

In order to define sparsity, we have to keep track of the various ways in which a 
subnetwork counted by some statistic in S n = (S 1 ™, . . . , S%) could have been incidentally 
generated. 

There are typically many ways in which gi G G™ could be incidentally generated, 
but many of them are equivalent up to relabelings. For instance, there are many 
different combinations of triangles and edges that could incidentally generate a triangle 
gi = {12, 23, 31}, however there are only eight ways in which it be done if we ignore the 
labelings of the nodes outside of gg. link 12 could be generated either by a triangle or 
link, and same for link 23, and 31, leading to 2 3 = 8 ways in which this could happen. 

Consider some g% G Gg on a set of nodes NpdN that is incidentally generated by a 
set of subnetworks {gi}j<=j and also by another set {gi }/gJ'- We say that {gi}j^j and 
{g^ }j>ej> are equivalent generators of gi if for each gi there is gi such that £j = £y and 
9j H ge = 9j' H gg. So the generating sets play the same roles in gg but might involve 
different nodes outside of N(ge). Note that equivalent sets of generators must have 
the same cardinality as they must both be minimal and involve the same intersections 
with gg. 

Given this equivalence relation, there are equivalence classes of generating sets of 
networks for any gg. It is easy to verify that there are at most Mg < k me equivalence 
classes of (minimal) generating sets for any subnetwork gg. For each class in the 
equivalence classes we have some list (£j,kj)j e j of the types of subnetworks and the 
number of nodes that the given subnetwork has intersecting with gg. We call these the 
(minimal) generating classes of a network gg and note that these are the same for all 
members of G™, and so we refer to them as the generating classes of £. 

Figure 3 provides an illustration. 

5.2.2. Relative Sparsity. Given a set of nicely ordered subgraph statistics S n = (S™, . . . , S%) 
and any £ G {1, . . . , k} and any generating class (£j, kj)j e j of £. Let Mj = (Sjej %) ~~ 
mg: 36 For example, in forming a triangle from any combination of triangles and links, 
each kj = 2 and so Mj = 6 — 3 = 3. 

We say that a sequence of models as defined in Section 5.1 with associated nicely- 
ordered subgraph statistics S n = (S™, . . . , S%) and parameters p n = (Vii ■ ■ ■ iVk) is 
relatively sparse if for each £ and associated generating class (£j, kj)j & j: 

n A ''E pn (S?) ' 

To make this concrete, consider our example with triangles and links. The only 
generating classes to check in the condition are those of triangles. A triangle can 
be generated by other combinations of links and triangles. The expected number of 
triangles that nature generates directly is EpyfSy] = Pt\Z) an d the number of links 

not in triangles is (approximately) Ep^S*]?] = pi (Y™) ~ O (^ r (s)))' ^ nus ^ mus t be 
that for each generating class, 

I\j^ P "(Sg) 

« ► 0. 

Pttv* 

36 Note that Mj > 1 since J > 2 and each set of kj nodes intersects with at least one other set of kf 
nodes for some j' j. 
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(c) Links form (d) Resulting network 

Figure 3. The network that is formed on n nodes is shown in panel D. 
First, triangles form independently with probability pj,. Next, links form 
independently with probability p\ on the remaining part of the graph. In 
panel C we see one incidence of an extra triangle that has been generated 
by this process. 



For the generating class of all triangles, this implies that p\n 3 — > 0, so px = o(n~ 3 ^ 2 ). 
For the generating class of all links, this implies that 37 p\jpr 0, which is the obvious 
condition that triangles formed by independent links are rare compared to triangles 
formed directly. This implies that (but is not necessarily implied by) Pl = o(n~ l l 2 ). 
The remaining generating class combinations are then implied by these ones. 

For example, letting = a{n)/n 2 and pi = b/n, where a(n) = o(n 1,/2 ) satisfies the 
sparsity conditions and leads to an expected degree of b + a(n)/3 and an average clus- 
tering of roughly 6 ( b+a ( n )/ 3 °(b+ a ( n )/ 3+1 ) • This can be consistent with various clustering 
rates, and admits rates of links and triangles found various observed networks. 38 

5.3. Estimation of Sparse Models. 

S n denotes the actual number of subnetworks of various types that are truly gener- 
ated, which are not observed by the observer since the resulting g may include inciden- 
tally generated subnetworks. Let S n (g) the observed counts including the incidentally 

37 Given that p T = o(rt~ 3/2 ), it follows that E PL [Sj?] = p L ((' 2 l ) - O (prQ))) is proportional to p L n 2 . 
38 To match very high clustering rates the model can be altered to include cliques of larger sizes. 
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generated subnetworks. So, in Figure 3, = 9 but S n (g) = 10 and from observing 
g there is no way to know exactly what S% is. Nonetheless, as we prove, under the 
sparsity condition we can accurately estimate the true statistics, and thus the true 
parameters under the sparsity conditions. 

Before stating our next result, we define the following notation. 

Let S £ (g) be the maximum count of S 1 ™ that is possible on network g. So, for 
instance if we are counting triangles and links not in triangles, then S^(g) = and 

S n L {g) = f") — Lx{g) where Lx{g) is the number of links in triangles in g. 39 



Let 



Pl(g) 



s?(g) 



and let D n = Diag {p£(#)n m *}J =1 . 

So, Pi{g) is the fraction of possible subgraphs counted by S 1 ™ that are observed 
in g, out of all of those that could possible exist in g. This is a direct estimate of 
the parameter p™, as if these subgraphs were each independently generated and not 
incidentally generated. D n is a normalizing matrix. 

In order to have Pj{g) be accurate in the limit two things must be true. First, 
the network is relatively sparse, which limits the number of incidentally generated 
subgraphs. And, second, it must be that the potential number of observations of a 
particular kind of subgraph grows as n grows. This would happen automatically in a 
sparse network setting if we were simply counting triangles and links not in triangles. 
However, if nodes have different characteristics (say some demographics), and we are 
counting triangles and links by node types, then it will also have to be that the number 
of node with each demographics grows as n grows. If there are never more than 20 
nodes with some demographic, then we will never have an accurate estimate of link 
formation among those nodes. 

So, we say that a SUGM model is growing, if the probability that Sf(g) — > oo for 
each £ goes to 1. 

Theorem 2. Consider a sequence of growing subgraph models as defined in Section 5.1 
that are relatively sparse and have associated nicely- ordered subgraph statistics S n = 
(Si,...,S%) and parameters p n = (p™, ...,p^). The estimator is ratio consistent: 40 

P -> 1 for each i. Moreover. , 41 



Theorem 2 illustrates that nicely-ordered and relatively sparse SUGMs are consis- 
tently estimable via easy estimation techniques: ones that are direct and so trivially 
computable. 



39 In sparse networks, ir(ff) would be vanishing relative to (™) and so could be ignored. And typically, 
in sparse networks, S n L {g) will be well approximated by ye(^ e ), where yi is the number of possible 
different subgraphs of type £ that can be placed on rri£ nodes (e.g., yi is 1 for a triangle, m for a star 
on m links, etc.). 

40 This, of course, implies consistency. But given that the parameters might be converging to 0, the 
ratio version is important. 
1 J is the fc-dimensional identity matrix. 
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The proof of the theorem involves showing that under the nicely-ordered and sparsity 
conditions the fraction of incidentally generated subnetworks vanishes for each £, and 
so the observed counts of subnetworks converge to the true ones. Given that these are 
essentially binomial counts, if the rate at which the incidental counts are generated 
are limited sufficiently, then as the second part of the theorem states, a variation on 
a central limit theorem applies, and then normalized errors in parameter estimation 
are normally distributed, and we know the rates at which the parameters converge 
to their limits. For inference and tests of significance for single parameter values we 
note that analytic estimates of the variances are directly computable from the analytic 
expression of the diagonal of the variance matrix. Of course, more complex inferential 
procedures and tests can be executed through a simple parametric bootstrap as the 
model is easily simulated. 

To make the convergence rates concrete, take our example with links and triangles, 
letting pt = a/n 2 and pl = b/n. (This will not be a tight bound, but an example of a 
realistic level of sparsity that satisfies our conditions for asymptotic normality.) Then 
one can check the incidental generations for triangles is o p (n 1 / 2 ), which means that the 
fraction of incidentally generated triangles is o p (n -1 / 2 ). 

Again, we emphasize that although the estimator here is based on binomial approx- 
imations, a SUGM still incorporates interdependencies directly through the subgraphs 
that are generated. The results make use of the fact that in sparse settings, the picture 
of interdependencies is clear and are measured by the statistics one-by-one. 

6. Simulation Exercises 

We numerically illustrate the SUGM approach in several ways. We conduct two 
exercises using data generated under our model. In both exercises, we use the same 
class of models: a nicely-ordered sparse SUGM with triangles, links and (possibly) 
groups. 

6.1. Counting Triangles. In the first exercise, we compare estimation of a SUGM 
to an alternative that does not directly account for triangles, but only captures them 
indirectly through covariates. That is, the alternative model is one that only accounts 
for links on a bilateral basis, and triangles form because three individuals were all very 
close to each other in their characteristics (e.g., location, age, etc.) and so happen to 
form a triangle. 

This exercise illustrates an important bias that emerges if one does not properly 
account for direct triangle formation: homophily effects are dampened since cross 
group links are sometimes needed to form appropriate triangles. 

This simulation is based on a model in which every node belongs to one of two 
groups. Due to the presence of homophily, a node is more likely to form a triangle with 
two other nodes of the same group than to form a triangle where there is heterogeneity 
in group memberships. Similarly, a node is more likely to form a link with another 
node of the same group. 

Every triple of nodes is designated as either "within" or "across," where a within- 
triple is such that the three nodes are in the same group, and across is any other triple. 
Similarly, every pair of nodes is characterized either as "within" or "across". 
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The nicely-ordered sparse SUGM is as follows. In the first step, triangles are formed 
independently with probability p^ = 20/n 2 for within-triples and p^ = 2/n 2 for across- 
triples. Next, links are formed with probabilities p^ = 4/n 1A and pi = l/(4n 1 ' 1 ) for 
within and across links, respectively. As such, the sparse SUGM is characterized by 
P = (PljPliPtiPt) ( or correspondingly (3k = log f° r a SERGM form). 

We report results from two estimation procedures. In the first procedure, we directly 
estimate p^ and p^ as the share of observed within and across triangles that have 
formed, respectively, and then estimate p% and p\ as the share of observed remaining 
within and across links, respectively. Our theoretical results have demonstrated that 
such an estimation technique is asymptotically consistent. (We also compute corre- 
sponding /3's.) 

In our second estimation procedure, we misspecify the model to capture the inter- 
dependency through observed homophily alone. In this case, link formation as comes 
from 9 = {pliPIl)-! alone. 

In our simulations we set n = 500 and use 100 simulations. Results of the parameter 
estimates are presented in Table 2. 



Model Parameter Estimates 



Panel A: Estimation Results for p 
Pi 

Truth 0.0043 
SubGM 0.004 
Homophily Only 0.0173 


P2 
0.0003 
0.0003 
0.0041 


P3 
0.0001 
0.0001 


p4 
0.000008 
7.7907E-06 


Absolute Difference Between Estimated to Truth (SubGM) 
Absolute Difference Between Estimated to Truth (Homophily) 


0.0003 
0.013 


3.0591E-06 
0.0038 


0.000024854 


2.0932E-07 


Ratio of Estimated to Truth (SubGM) 
Ratio of Estimated to Truth (Homophily) 


0.9358 
4.0218 


0.9886 
15.1618 


0.6893 


0.9738 


Panel B: Estimation Results for (3 

p, 

Truth -5.4455 
SubGM -5.5121 
Homophily Only -4.0406 


P 2 
-8.2221 
-8.2336 
-5.4995 


P 3 
-9.4334 
-9.8055 


Pt 
-11.7361 
-11.7626 


Absolute Difference Between Estimated to Truth (SubGM) 
Absolute Difference Between Estimated to Truth (Homophily) 


0.0666 
1.4048 


0.0115 
2.7226 


0.3721 


0.0265 


Ratio of Estimated to Truth (SubGM) 
Ratio of Estimated to Truth (Homophily) 


1.0122 
0.742 


1.0014 
0.6689 


1.0394 


1.0023 



Noles: Estimation as described in [he text, with 100 simulation repetitions performed. 



Table 2 

As expected, we find that the correctly specified SUGM generates accurate results 
while the misspecified homophily model induces biases. Specifically, the ratio of the 
within to across linking probabilities is underestimated. 

6.2. Matching Features in Data. We next examine how well a sparse SUGM with 
links, triangles, and node characteristics (including geography) does at matching other 
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features of the network in actual data, when those other features are not directly 
measured in the model. That is, if the model only uses triangles and links, how 
well will it reproduce other features of the networks such as component structure, 
degree distributions, and various eigenvalue properties of adjacency matrices. For our 
exercises we make use of Banerjee et al. (2012) data-set consisting of networks in 75 
Indian villages. Here we focus on two sorts of networks: risk-sharing networks and 
friendship networks. 

In addition to the average degree and clustering (which will be at least partly cap- 
tured by links and triangles), we are interested in other quantities motivated by theory. 
We look at the first eigenvalue of the adjacency matrix, which is a measure of diffusive- 
ness of a network under a percolation process (e.g., Bollobas et al., Jackson (2008)). 
We are also interested in the second eigenvalue of the stochasticized adjacency matrix. 
This is a quantity that is key in local average learning processes and modulates the 
time to consensus (Golub and Jackson (2012)). Finally, we look at the fraction of 
nodes that belong to the giant component of the network, as empirical networks are 
often not completely connected. 

Our procedure is as follows. For every village, for each type of network (risk-sharing 
or friendship), we estimate each of two network formation models. The first network 
formation model is a simple sparse SUGM with triangles, links, and geographic covari- 
ates (a dummy for being within the median distance in the village). The second network 
formation model is a simple conditional edge independent link formation model where 
the probabilities also depend on the geographic covariates. Given our estimated pa- 
rameters for the village network, from each model generate 100 draws from the SUGM 
model and the homophily model. We then study the properties of these generated 
networks across the aforementioned network characteristics: degree, clustering, frac- 
tion of nodes in the giant component, maximal eigenvalue, second eigenvalue of the 
stochasticized matrix. 

Table 3 presents the results of our simulation exercise. We show that networks 
simulated from our estimated parameter values according to the sparse SUGM better 
display the structural properties exhibited by the empirical Indian village networks 
than those simulated from a homophily-based model. 

Panel A presents results for the risk-sharing networks and Panel B shows results for 
the friendship networks. Both the sparse SUGM and the homophily models do quite 
well for average degree. As expected, the homophily model does extremely poorly when 
it comes to clustering while the SUGM does quite well. 

Interestingly, conditioning on the triangles in the SUGM is enough to deliver matches 
on several other dimensions. Let us focus on the risk-sharing networks for exposition. 
The fraction of nodes in the giant component is very close between the empirical 
data and the SUGM (77% to 74%), as is the first eigenvalue (5.23 to 5.26), as well 
as the second eigenvalue (0.94 to 0.95). Meanwhile, the homophily model performs 
rather poorly: overestimating the giant component (88%), underestimating the first 
eigenvalue (3.9) as well as the second eigenvalue (0.28). These patterns hold for the 
friendship networks as well. 

Ultimately, this suggests that there is substantial value added simply by modeling 
the formation of a sparse number of triangles. 
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Properties of Networks 





Panel A- Risk sharin networks 








Truth 


SubGM 


Homophily Model 




2 695 


2.524 


2 5353 


Average Clustering 


0.1296 


0.1131 


0.0107 


Fraetion in Giant Component 


0.7729 


0.7403 


0.8771 


First Eigenvalue 


5.2329 


5.2555 


3.9019 


Second Eigenvalue of Stochastized Matrix 


0.9422 


0.9506 


0.2841 




Panel B: Friendship networks 








Truth 


SubGM 


Homophily Model 


Average Degree 


3.3337 


3.4109 


3.3864 


Average Clustering 


0.1073 


0.1295 


0.0189 


Fraction in Giant Component 


0.8131 


0.8343 


0.9599 


First Eigenvalue 


6.5738 


6.3715 


4.6249 


Second Eigenvalue of Stochastized Matrix 


0.9193 


0.9153 


0.2792 



Notes: Estimation as described in the text, with 100 simulation repetitions performed. 



Table 3 




123456789 10 123456789 10 

Degree Clustering x 10 



(a) Degree distributions (b) Clustering Coefficients 

Figure 4. Examples of degree and clustering distributions. 

7. Extensions and Applications 

We now describe a few special cases of SERGM and SUGM models that illustrate 
their use in analyzing strategic network formation. We also apply the model to data. 

7.1. Strategic Network Formation. As mentioned in Section 2.6, an important 
use of SERGMs (and SUGMs) is for estimating strategic network formation models. 
While a full treatise on this subject would take us away from the focus of this paper, 
we illustrate how the models can be used in some contexts, with extensions to other 
contexts left to the reader and further research. 



7.1.1. Mutual Consent Formation Models. Here we describe a strategic network model 
that harnesses some of the power of Theorem 2. The key aspect of the model is 
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that decisions to link are not only bilateral but instead multilateral: sub-groups of 
individuals decide whether or not to form cliques. A pair of individuals may meet and 
decide (mutually) as to whether to add a single link, but also a group of three (or 
more) may meet and decide whether to form a clique. Moreover, the probability that 
they form the clique could depend upon the characteristics of the individuals involved. 

Consider cliques CL m C {1, . . . , n} of m nodes with characteristics X = (Xi) ie ci m 
meeting at random. There are certain aspects, hi(X), that affect i's benefits from the 
clique. 42 

There is a probability 7t m (X) that a clique of m individuals with characteristics X 
meets. So it might be more likely that individuals of similar ages meet than ones with 
different ages. 

Individual i obtains a utility 

Ui,Cl m — Jm,Xihi(X) — 6i£l m 

from the formation of a given clique Cl m , where 7 mj X; depends on the size of the clique 
and possibly on the characteristics of i and €i t ci m is a random idiosyncratic term. 

The clique then forms conditional upon it having met if and only if Ui } ci m > 
for all i (say with at least one strictly positive). 43 If the error term has an atomless 
distribution, then the strictness is inconsequential. 

Let F m (-) be the distribution of error terms when the clique is of size m. 

So, the probability that a clique Cl m of size m with characteristics X = (Xi) ie ci m is 
formed is 

Pm,X = K m (X) Y[ F m {j m> Xihi{X)) . 

Suppose that the largest cliques that can form are of some size M. Let S mj x{g) 
denote the number of cliques of size m 6 {2, . . . , M} consisting of individuals with 
characteristics X that form in a network g, counted excluding cliques that are already 
a subset of some m! > m. M Under the conditions of Theorem 2, the SUGM in (7.1.1) 
is then easily estimated and consistent. 45 



For instance if X were a list of the individuals' ages, then it might be that i's benefit from the 
clique is a function of i's distance from the average characteristics: hi(X) = Xi — Xj/(m — 1) . 

It could also be that i benefits from the maximum value of X_i, or suffers from variation in the 
characteristics, h can be tailored to the specific application and list characteristics. 
43 This then corresponds nests pairwise stability as defined by Jackson and Wolinsky (1996), subject 
to the meeting process. One can adjust this to take into account other rules for group formation. 
44 For instance if there is a group of four identical individuals in a clique, then we will not also count 
cliques of size three that are a subset of that, but only count cliques that are not already subset of 
some larger clique. 

Although the probabilities of various cliques are directly estimable (and hence identified) under 
the conditions of Theorem 2, of course whether the various parts of K m (X)Y\ it - cl F m (j m hi(X)) 
are well identified depends on the specifics of the m and the functional forms involved. Just as an 
example, consider a situation with two types. Set Xi = 1 if i is of type 1 and Xi = 2 if i is of 
type 2 and hi(X) = Xi — Xj and consider m — 2. Then 72 and —72 both lead to the same value 
of F2 (j2(Xi — Xj)) x F2 (72 (Xj — Xi)). So here it could not be judged whether the type 2 has a 
greater expected utility (net of the random term) from the match than the type 1 or whether it is 
the reverse. There are some obvious simplified formulations that allow for identification, for example 
setting instead hi{X) = \Xi — Xj\. It might also require specifying a (nonlinear) functional form for 
7r m as in Currarini et al. (2010). 
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Note that the above formulation centers on cliques, but substituting H m to be some 
particular subgraph instead of the clique, and keeping track of each node's position in 
the subgraph. 

It is important to note that such a formulation allows us to do welfare computations 
and changes in welfare due to changes in, say, the distribution of X if they include 
parameters that a policymaker may control - or it may be that a policymaker could 
change the it function in some well-defined way by say, subsidizing interactions among 
groups with certain sorts of characteristics who might tend to meet infrequently. 

7.1.2. Search Intensity Models. Although the above model allows for strategic forma- 
tion, it is a model where there is no direct strategic interaction between agent's decisions 
to form subnetworks. Agents effectively have dominant strategies to try to form all of 
the cliques that benefit them, and the ones that form depend on the actions of others. 

To account more directly for externalities that are present in network formation pro- 
cesses, we can also include search intensities as have been analyzed in various formation 
models such as Currarini et al. (2009), Currarini et al. (2010), and Golub and Livne 
(2010). Those models are of bilateral (independent) link formation; but are easily 
extended more general SUGMs as we briefly describe. 

Each agent i with characteristics Xi puts in a search effort e(Xj, m, X) G [0, 1] to form 
cliques of size m with characteristics X. m = 2 indicates links, and so e(Xj, 2, (Xi, Xj)) 
is the effort that agent i expends in trying to form links with agents who have char- 
acteristics Xj; and e(Xi,3, (Xi, Xj, Xh)) is the effort that agent i expends in trying 
to form triangles where the other two agents have characteristics Xj and X^, and so 



"Effort" is simply a shorthand for either the time spent socializing in various ways, 
or else it could simply indicate a relative openness to forming relationships of various 
types. 

An agent obtains a utility 



from being of type Xi in a clique of size m with characteristics X. 

The probability that a given clique Cl m forms depends on the vector of efforts for such 
cliques of those in the clique, (ej(, m, X))j & ci m , according to a function 7i mt x((ej(m, X))j g cz m ) 
that is nondecreasing in each of its arguments. 

An agent also pays a cost of network formation: c(Xj, (ej(m, X)) m X ) that depends 
on his or her characteristics Xj and the search efforts that he or she exerts in forming 
various links and cliques, (ej(, m, X)) m ^ X - 

Thus, an agent i's overall payoff as a function of the all of the agents' efforts is 
described by 



In a case where the w's are nonnegative, this defines a supermodular game: agent i's 
change in payoff from increasing any dimension of (ej(m,X)) m x is nondecreasing in 
the vector of strategies (ej(m, X))j^ mt x- In such games, pure strategy equilibria exist 
and form a complete lattice (e.g., see Topkis (2001)). Additional conditions on ir, u, 



forth. 



u(Xi,m,X) 
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and c can ensure uniqueness of equilibrium, depending on the specific functional forms 
that are used to parameterize the model, or one can appeal to equilibrium selection. 46 
Models of this structure thus define SUGMs, where the relative frequencies p m ,x of 
cliques of size m consisting of agents with characteristics described by the profile X. 
Specifying functional forms for n, u and c then allows for estimation of parameters 
of the model and of the equilibrium, provided the specification is tight enough to be 
well-identified. 

Again, although the above formulation is described for cliques, it is easily adjusted 
for any subgraphs (for instance an agent may value being at the center of a star with 
m agents, etc.). In the obvious extension one needs to keep track of the positions of 
the various types of agents in the subgraph as there are then asymmetries in positions 
and, for instance, agents might care about the characteristics of the agent at the center 
of a star. 

7.1.3. Other Formulations. 

The above formulations allow for various cliques to form, and thus nontrivial clus- 
tering in networks, as well as high frequencies of friends in common even in sparse 
networks. They are also extend directly to other sorts of subgraphs: stars, lines and 
other configurations. 

Another feature of interest is that an agent may care about friends of friends: and 
may care to link to others who have more neighbors. For example, that is a central 
feature in the connections model of Jackson and Wolinsky (1996), and will be a feature 
of settings where agents care substantially about indirect communication and favors, 
etc. 

Such a formulation can be viewed as an ERGM, as shown by Mele (2011). Specif- 
ically, in his model in every period a pair of individuals is called to meet and then 
the agents decide whether or not they want to link. Their utility from forming the 
link may depend on the network that has formed through that stage. Mele shows that 
in a model where agents value links and the indirect connections that they convey in 
an additively separable form, the invariant distribution of such a process results in an 
ERGM. 47,48 In particular, his model specification leads to a setting where the relevant 
ERGM depends on links and two-stars (so friends of friends). By applying our results, 
such a model can then be estimated directly and easily as a (sparse) SUGM. 

Another novel approach to estimating strategic network formation models is to work 
with ones that can be projected onto subnetworks, as by Sheng (2012). Building on 
techniques from estimating bounds from games with multiple equilibria as in Ciliberto 
and Tamer (2009), Sheng derives bounds on what sorts of subnetworks should be 
observed under pairwise stability. Ultimately, her approach could be used to produce 
a SUGM which then could be estimated using our techniques. 



Here there are positive spillovers/externalities from strategics, and so generally the maximal equi- 
librium will Pareto dominate the others, and so a standard refinement would be to look at the Pareto 
efficient equilibrium which is then unique and pure (e.g., see Vives (2007) for some background). 
47 Thcre must be some ERGM representation of the invariant distribution by the Hammcrslcy-Clifford 
theorem. The Mele result shows that the potential function associated with the game essentially defines 
the corresponding ERGM. 

48 For an alternative specification in terms of a Bayesian equilibrium, see Leung (2012). 
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7.2. Noise and Almost-Cliques. 

Although observed social networks often exhibit significant clustering and numbers of 
triangles significantly above what would be observed with independent link formation, 
typically full cliques of larger sizes are rarer. There can be different reasons for this, 
and here we briefly discuss a correction for this. 

One reason for a failure to see completed cliques of higher level is that small amounts 
of measure error makes them exponentially (to the square of the size of the clique) less 
likely to be observed. To see this point, consider the simplest situation such that there 
is a probability e > that a data set fails to exhibit any given link (independently 
across links). A clique of £ nodes has £(£ — l)/2 possible links. If it were present, then 
it would be fully observed only with a probability that all of its links are observed, or 
(1 — s)^ -1 )/ 2 . For example, suppose that e — .1. Then the probability that a clique of 
size 3 is observed is .73, while this drops to .53 for a clique of size 4, to .35 for a clique 
of size 5, and to .21 for a clique of size 6. 

Why is this an important issue? Suppose that the true model generates cliques of 
size 3 and 4 in addition to links. The 27 percent of triangles that are missed due to 
measurement error, will end up classified as two or fewer links, thus biasing downwards 
the parameter on triangles and upwards that on links. The 47 percent of cliques of 
size 4 that are missed, will generally contribute to increased observations of triangles 
and links. For instance, deleting one link from a clique of size 4 makes it appear as 
two cliques of size three. Taking two links out makes it either lead to a triangle plus 
a link or four extra links. This can substantially bias upwards the counts of triangles 
and links. 

There are two ways to deal with this, one more precise and one easier but less precise. 
The precise way to do this, is to model the error and include it in the specification of 
a SERGM or SUGM. For example, given a SUGM with links, triangles and cliques of 
size 4 together with probability e of missing a link in the data, then the probability of 
seeing a clique of size 4 becomes p 4 (l — e) 6 . The probability of seeing a clique of size 4 
less one link (under the sparsity conditions) becomes ^46(1 — e) 5 e, and the probability 
of seeing a clique of size 4 less two links (again under the sparsity conditions) becomes 
p 4 15(l — e) 4 e 2 , and so forth. Thus, one can then consider the following count statistics: 
links, two-stars, triangles, cliques of size 4 less two links, cliques of size 4 less one link, 
and cliques of size 4, etc. Each of these has a well defined probability given the original 
specified P2, P3, Pi and the e. One can then estimate these parameters (including the 
e) using the SUGM techniques from our theorem. 

A less precise way to do correct for incomplete cliques would be to simply count any 
structure on £ nodes that has at least x^£{£— 1)/2 links as a clique of size £, where < 1 
is some simple adjustment factor. While a crude adjustment, this could still improve 
over ignoring the issue altogether. For example, in a setting with links, triangles and 
4-cliques, one could make a simple adjustment by counting any configuration on four 
nodes with at least 4 links as a 4-clique, but then stick with counting triangles only if 
all three links are present and they are not part of any modified 4 clique. 

While measurement error is one possible way in which a clique could form with miss- 
ing links, another possibility is that the formation process is one where agents choose 
which relationships to form, but have payoffs to cliques that allow them to benefit 
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from fewer than all links being present, and allowing some noise in the formation (ei- 
ther utility-based or trembles) that lead to some links in cliques not forming. While 
different in interpretation, the techniques that account for such missing links would 
be similar to that described above: simply directly accounting for the probabilities 
that various subgraphs form. As SERGMs and SUGMs can already admit arbitrary 
subgraphs as statistics, this would simply place additional restrictions on the relation- 
ships between the probabilities of various subgraphs, but would otherwise use similar 
estimation techniques as outlined in our results. 

7.3. The Use of SERGMs and SUGMs in Policy Analysis. One motivation 
for modeling network formation is so that we can also understand what might lead 
to changes in network structures. This is particularly important in situations where 
network features are drivers of economic behaviors: for instance, expansion properties 
of a network can impact diffusion of innovations (e.g., it might be the case that in 
networks with a higher maximal eigenvalue farmers are significantly more likely to 
have learned how to properly use a new farming technology that has been introduced). 
By estimating a network formation model (a SERGM or SUGM), we can understand 
what sorts of interventions could generate variation in network features of interest. 49 
We might find a certain sort of intervention (e.g., something to enhance the meeting 
process, or programs to encourage relationships across groups) would significantly affect 
homophily, expansion properties of the network, or some other feature of interest. This 
can then lead to changes in diffusion of information or other behaviors. 

7.4. Multigraphs and motives for linking. An important question in analyzing 
economic and social networks is what sorts of interactions exist between different sorts 
of relationships: does the fact that two agents borrow and lend money to each other 
make it more likely that they will also seek advice from each other, or engage in 
other sorts of relationships? Is it more likely that pairs of agents who have multiple 
relationships with each other will be embedded in cliques? 

The SERGM and SUGM models that we described extend directly to multigraph 
settings in obvious ways, as we illustrate in the application below. 

7.4.1. Empirical Application: Motives for linking. The observation that different re- 
lationships that individuals may have with each other may be interdependent dates 
back to Simmel (1918). Here, we outline three basic theories for multiplexing and then 
investigate them in some data using our results. 

First, there may be a fixed cost of making links. Conditional on having established 
a relationship with another person, it is cheaper to construct the second relationship. 
That is, if i is willing to link to j to borrow rice, then it might be cheaper to for i to 
also establish a social link with j. 

Second, decisions of individuals to form relationships depend on their characteristics. 
To the extent that some of those characteristics are not observed in a data set, their 
decisions to form relationships could be correlated even after adjusting for all observable 
characteristics. 

Third, is what we call a support theory based on favor exchange. In such a theory 
there are two ways to support the exchange of favors. One is to have frequent enough 



Thanks to Esther Duflo for emphasizing the importance of this use of the models. 
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favor exchange so that it is in each individual's interest to provide favors whenever his 
or her friend asks for one. In this way, an isolated pair of individuals who have multi- 
ple relationships (exchanging various types of favors, etc.) can sustain favor exchange 
because the multiple relationships make the pairing valuable enough to continue ex- 
changing favors. Another way to provide incentives to exchange favors is to embed 
that exchange in a clique. If some individual in a clique fails to provide a favor when 
asked, then all of the other individuals in the clique collectively punish the agent by not 
providing favors in exchange. The embedding in a clique provides the additional incen- 
tives necessary to motivate favor provision. Thus, the leverage needed to incentivize 
the provision of favors comes either through building valuable multi-level bilateral re- 
lationships, or else by embedding the exchange within cliques with more individuals. 
It should be rarer to see isolated favor exchange between two individuals who do not 
share multiple relationships and do not have friends in common. 50 

To examine these theories, we use the framework outlined in Section 5.1. We use 
the data collected across 75 Indian villages collected by Banerjee et al. (2012). 51 The 
specific usefulness of the data is that it includes multigraphs including various forms of 
favor exchange links (whether households borrow or lend kerosene, rice, money,...) and 
social links (whether household members visit each other socially, go to temple together, 
...). For the purposes of our analysis, we use all 75 Indian villages and construct 
networks at the household level. We build undirected, unweighted multigraphs with 
two link types (where a link of a given type is present if either household claimed that 
type of relationship with the other) . Favor exchange links indicate whether households 
borrow/lend kerosene, rice and other material goods from/to each other and social 
links indicate whether members of a household visit members of another household or 
receive these members as guests. 

To distinguish between the first two theories, we examine multiplexing of supported 
links (those in cliques) to those that are not supported. Under a theory of fixed 
costs of linking, the multiplexing would have no reason to vary, while under a theory of 
unobserved characteristics they would. In particular, unobserved characteristics should 
result in triangle formation among trios of nodes that have similar characteristics and 
then should also have high multiplexing. Unsupported links are more likely to have 
arisen spuriously and are thus more likely to be unsupported. The evidence is consistent 
with the similar characteristics theory but not with fixed costs, as more than 70 percent 
of supported links are multiplex, while less than 20 percent of unsupported links are 
multiplex, as described in Table 4. 52 



For more on this, see Jackson et al. (2012). 
51 For more background on the data see Banerjee et al. (2012), and for more on favor exchange in the 
data see Jackson et al. (2012). 

52 The entries of Table 4 are constructed as follows. In columns 1-2 we consider either the favor 
graph or the social graph and construct the share of links in a village that are supported (members 
of a triangle). We then compute the mean and standard error across the 75 independent villages. 
In columns 3-4 we consider either the favor graph or the social graph and construct the share of 
unsupported links that are multiplexed within the village. Again we present the mean and standard 
error across the 75 villages. In columns 7-8 we consider the share of total links of a given type that 
are either supported or multiplexed. Again we present the mean and standard error across the 75 
villages. Columns 3, 6, and 9 present differences between the favor graph and social graph shares with 
bootstrapped standard errors. 



TRACTABLE AND CONSISTENT RANDOM GRAPH MODELS 



34 



Support and Multiplexing 
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Multiplexed/Total 
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>r Multiple. 


ad/ Total 
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Social 


Difference 


Favor 


Social 


Difference 


Favor 


Social 


Difference 


0.7303 


0.6928 


0.0376 


0.1907 


0.1400 


0.0508 


0.9210 


0.8327 


0.0883 


(0.0113) 


(0.0111) 


(0.0026) 


(0.0085) 


(0.0066) 


(0.0028) 


(0.0044) 


(0.0062) 


(0.0042) 



Notes: Favor links are borrowing/1 end mil kerosene or rice. Social links are given bv ihose who visil one's house or [hose whose house the individual visits. In columns 1-2 we look at the 
share of a given link tvpe thai are supp< tried (as well as the difference in column 3). In columns 4-5 we look at the share of a given link type thai are unsupported but are multiplexed (as 
well as the difference in column 6). In columns 7-M we look at the share of a given link type that are eilher supported or multiplexed < as well as the difference m column '■>). Standard errors 
computed at the cross network level are presented in parantheses. 



Table 4 



Next, let us examine the support theory. Under that theory, favor exchange links 
should be more likely to either be embedded in triangles or multiplexed than social 
links that might not involve explicit favor exchange. Table 4 provides figures for both 
links that involve explicit favor exchange and those that are more social in nature. It 
shows that 73 percent of favor exchange links are supported (embedded in triangles) 
as compared to 69 percent of social links. While they are statistically significantly 
different, the 3.8 percent gap is small relative to the mean. Next we look at multi- 
plexing of unsupported links. This exercise of counting unsupported links separately 
is legitimized under our SUGM. We find that while nearly 19 percent of unsupported 
favor exchange links are multiplexed, only about 14 percent of social links exhibit mul- 
tiplexing. This 5 percent difference is large relative to the mean. Putting this together, 
favor exchange links are significantly more likely to be reinforced in some way - either 
through multiplexing or support - as is consistent with the support theory. 

The analysis finds statistics that are consistent with both the similarity in character- 
istics theory and the support theory, and not with a simple fixed cost theory. 53 Finally, 
we interpret the fact that favor exchange links are differentially more likely to be mul- 
tiplexed than social links to be suggestive evidence in favor of the support theory. The 
identifying assumption here is that any reasonable story of correlated characteristics 
driving such patterns which are different from favor exchange ought to be symmetric 
in their effect on multiplexing probabilities, whereas support theory suggests that only 
unsupported favor links ought to be more likely to be multiplexed. 

8. Conclusion 

We presented two new classes of models of network formation, SERGMs and SUGMs, 
based on the idea that network formation is driven by the properties of a network or by 
the formation of various subgraphs. This turns the focus away from the network as the 
unit of analysis, and instead focuses on its properties, subgraphs and statistics. This 
perspective allowed us to develop simple and direct estimation techniques and to derive 
results concerning consistency and asymptotic distributions of the estimators. Given 
the growing literature on estimation of network formation, such results are essential. 



It could be that there are more complex fixed costs that depend on subgraph configurations, but 
costs for second relationships would have to be significantly lower in cliques than in isolation. 
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The approach we have taken, can certainly be extended to other formulations. We 
have discussed the incorporation of several classes of models that allow for random 
meetings, search, and strategic elements of network formation. Given the numerous 
theories underlying social structure and network formation, more such models will 
certainly be needed and the basic methodology developed here should be helpful along 
that road. 
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Appendix A. Proofs 

Proof of Proposition 1. Recall that the MLE f3 n (s) is the (3 that solves 

s = E (S [S n ]. 
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Thus, since concentration implies that 

it follows that 

C n (E Hsn) [S n )-Ep n [S n })-^0. 
Given that expectations identification implies that 

^(E^l-E^DI > 7 |/3-/3"| 

for all n, it follows that 

\f3 n {S n ) - p n \^o 

as claimed. ■ 

These results can be rephrased in terms of standard properties of extremum estima- 
tors and identifiable uniqueness. 5 

Proposition A.l. If a sequence of SERGMs (S n , K$„, A n , (3 n ) satisfies concentration, 
rate- expectations identification, and the (3 n all lie in a compact set, then \(3 n (S n ) — 

Proof of Proposition A.l. Let 

m n (0) :=C n (S n -Ep [S]). 

The objective function is Q n {P) '■= mn( y /3)''m n (f3). 

First, we want to show that the moment function satisfies a uniform law of large 
numbers: sup^gg \\m n = o p (1). By concentration, we have pointwise convergence 
of m n ((3) to zero in probability. Therefore, we need to only check stochastic equicon- 
tinuity: that for every rj > there is a S > with 

P sup \m n (fi) - m n > 7] < rj. 
\\\l3-n<s J 

A sufficient condition (Andrews, 1994) is if a Holder condition is satisfied: 

|m n ((3)-m n (0') | <X n - \\I3-I3'\\ 

where X n is some O p (1) random variable. This is directly guaranteed by rate-expectations 
identification with X n = 7#. 

Second, one can check that expectations identifiability guarantees identifiable unique- 
ness. Together with compactness of £>, this implies the above implies that /3 is consistent 
for/3 n . ■ 

Proof of Proposition 2. Recall that the MLE (3 n (s) is the (3 that solves 

s = Ms n }. 

Given Proposition 1, we need only show that if consistency holds then concentration 
must also hold. 

54 Following, e.g., Gallant and White (1988), we say the sequence j3 n is identifiable unique on B if 

Urn inf inf Q ,n(/3") - QoM > 0- 

n->oo /3£Bpn (e) 
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Given that rate-expectations identification implies that 

a n \EpiS n ]-Ep4S n }\< lH \(3-(3 \ 
for all n, it follows that if if consistency holds so that 



:!cS 



\fi n {S n ) - T 



0. 



then it must be that 
This implies that 
which implies concentration. 



a-lE^S-J-E^ll-^O. 



a n \S n -Ep n [S r 



0. 



Proof of Theorem 1. 

Consider a sequence of simple count statistics S n = (S™, . . . , S%) whose £-th entry 
takes on nonnegative integer values with some maximum value S e — > oo, and the 
simple count SERGMs specified with K n (s) = Ue (J)- 



(A.l) P,(S n = s) 

We rewrite (A.l) as 

Pp (S n = s) 



Ue ( s s e e ] 


exp (/3 n ■ s) 






3D ex p & n 


■s>) 




\(s" 


) exp (ffls e ) 




Es'eA" 


Ue 


f|) exp {Pp' e ) 



If we consider the distribution conditional on S n lying in 

B n = Hi L^- [S?] (1 - s)\ , . . . , [E P n [5?] (1 + e) J } C A n 

i 

(for large enough n under the not conflicted condition), then we can write 
Pp(S n = s\s e B n ) 



Ue 


[(!,') 


exp (p?s t ) 




Ue 


EsJeBj? 


(!) 


exp (ffis't) 



or 



(A.2) 



Pp(S n = s\s e B r 



n 



Consider a binomial distribution that has p" 



(g) exp (ggg) 



with range from to Sg . 



The corresponding probability that Sf = se given se G Bf can be written as 

) exp (g^ 
E^bj. (5) exp (/?? 4) ' 
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Taking the product of independent binomials, 

(5) exp ifiiat) 



P (S n = s\s G B n ) = n 



For a binomial distribution, the probability that Sg G 5™ — > 1. Thus, under in- 
dependent binomial distributions, the probability that 5* n G £? n — > 1, and so the 
unconditional probability for any s G -B n converges to 

(a.3) P =n -£te^-. 

(A.3) together with (A. 2) then implies that the distribution of the SERGM is approx- 
imated by the corresponding product of independent binomials. 

The remainder of the claimed results then follows easily from standard properties 
of the binomial distribution. The Vgg terms are computed observing the fact that for 
g (x) the logistic function, 

exp(p) 
(l + exp(p)) 

and one can apply the delta method to obtain the desired result. ■ 



Proof of Theorem 2. 

We begin the proof by showing the following. For any £, the fraction of counts of 
subnetworks I generated incidentally by some other subnetworks goes to 0. That is, 
consider some t and gg G Gg on me nodes. Let 

g, _ E(S?) 

This is no more than p™, as the denominator includes all possible subgraphs of size I 
(where yf is the number of subgraphs of type I that can be formed on any given 
nodes). 55 Let us consider the probability Zg that gi is incidentally generated by other 
subnetworks. 

We show that z™/p2 — > 0, which implies that z^/p™ — > 0. 

Consider gg G Gg and an incidentally generating subclass (£j, kj)j € j. 

We show that the probability z™(J) that it is generated by this subclass goes to zero 
relative to ffi, so that Zg(J)/pg — > for each J, and since there are at most Mg < k mi 
such generating classes, this implies that zf /pg — > 0. 



Here we provide the proof that applies without subgraphs being characteristic dependent. The 
extension to characteristic dependent subgraphs is straightforward simply by adjusting all numbers 
to reflect possible networks with given node characteristics as dependent on n and the sets of nodes 
that have particular characteristics as a function of n, but it is notationally much more intensive. 
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For a subnetwork in G?"., the probability of getting at least one such network that 
has the kj nodes out of the me in g$ is no more than 56 

Thus, 



jn e . 



Therefore 



p7 p7 



^(j) < n ieJ ^n^p7. 



Recall that Mj = J2jej kj — me and that M > 1 (since | J\ > 2 and each set of kj 
intersects with at least one other set of kj> for some j 1 ^ j). 
Therefore 

Z »(J) < n^gjgn^jl 

The numerator is of the order IIj e jE(S'™) while the denominator is of the order n Mj E(S™). 
Under the sparseness condition, 

n^E(Sf) ' 

and so we have verified the claim. _ 

S n (a) 

To finish the ratio consistency proof, note that the claim then implies that e qn — > 

1. Thus, dividing top and bottom by S^(g), it follows that \ ~~ ^ 1- Given 

/ (s) 

the growing condition and properties of the binomial distribution, it also follows that 
„ -> 1, and so -> 1. 

Next, note that the above also implies that the distribution D^/ 2 (pi(g), ...,p%(g)) 
converges to the distribution of D\l 2 (p%(g), ...,p%(g)) , where p^(^) = S™ /S e (g). 

The asymptotic normality of the (joint) distribution then follows from the usual 
Linderberg- Feller central limit theorem applied to triangular arrays of binomial random 
variables. This applies under the growing conditions of the theorem. 57 ■ 



This is a loose upper bound as it simply adds the probability that each possible one forms - but 
becomes more accurate as the probability of any one occurring vanishes. 

57 Let X\t, ...,Xtt be a triangular array of Ber(p-r) random variables and ptT — > oo (as under the 
growing condition). To apply the Lindcbcrg- Feller central limit theorem for triangular arrays, we 
check the Lindeberg condition: for any e > 0, 

T PT (L PT ) g E [ X ^{^ * ^Tpt(1-Pt)}} - o(l). 

The condition is implied by the fact that Tpx —> oo, and from the sparsity conditions which imply 
that pt is bounded away from 1. 
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B.l. Ratio Convergence Results. A sequence of SERGMs (S n , Kg n , A n , (3 n ) is ratio- 
expectations-identified with respect to a sequence of diagonal C n with positive diagonal 
entries if there exists 7 > such that 



r^n Tp r Or, 



- 1 



> 7 



Ph 



for all n and /i. 

A sequence of SERGMs (S n , Kg n , A n , (3 n ) is ratio-concentrated with respect to a 
sequence of diagonal C n with positive diagonal entries if 



fin on 
^hh J h 

^hh^Po [Si 







for each /z. 



Proposition B.l. If a sequence of SERGMs (S n ,Kg n ,A n ,P n ) is ratio- expectations- 
identified and ratio- concentrated with respect to a sequence of diagonal C n with positive 

e n {s n ) h p, 



diagonal entries, then 



1 /or eac/i h. 



1-^0 



Proof of Proposition B.l. Again, recalling that the MLE (3 n (s) is the (3 that solves 

s = E p [S n ]. 
Thus, since ratio-concentration implies that 

r~in on 

^hh^h -. P 

it follows that 

Given that ratio expectations identification implies that 

1 ChhEp[Sh) l\> j\ ^ h 



for all n, /i, it follows that 



or 

Hs n ) h p 

as claimed. ■ 
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B.2. Relationship with Binomial Models. 

We now discuss some simple conditions for consistency from the perspective of a 
binomial distribution. Consider the probability distribution defined over |o,...,X n | 

by 

p (Y \ K n (x) ■ exp Qg ■ x) 

^ P 

We can ask under what assumptions on {K n (■)} nen does j3 — > j3? It may seem odd 
first that a single draw along a sequence of distributions can generate a consistent esti- 
mation. However, we make a simple observation that we can transform this distribution 
into one that resembles a reweighted binomial distribution: 

( Xn ) -exp((3-x + X n (x)) 
(B.l) P fi (X n = x)= KxJ [ V { )} 



where X n (x) := log (w n (x)) := log (K n (x)/ [ 



' x "Yexp(/9-x) 



This comes from the fact that P« (X n = x) = ^ x ,-l N is the distribution 

corresponding to the probability distribution of a binomial, Bin (X n ;p = 1 ^^ j3 )- In 
turn, assumptions on X n (x) will allow for consistency of the MLE of (5. 

^ cx p Sn. 

Lemma B.l. (3 := 1+ap \ be the MLE in the above model. Assume co n (•) is such that 

Sri 

cov Bin(X n ; P ) ^ = Q f E Bin(x n ;p) ^ ^ 



^ P 

uniformly in a compact neighborhood of p. Then p — >p. 
Proof. First the normalizing constant can be written as 

(i-p) 

This follows from 



Sn 



' S n\ (, ( V \ , \ f \\ ( S n\ ( P 

exp log • s n + X n (s r 



and therefore 

5 (<) - " p ( los fe) ■ < + A » <•») = S (f:;) {ihf* «•> - 

Second, the maximum likelihood estimator solves the FOC of 

s n log (-?-) + log (1 - pf" - log (Ef " K (Ol) , 



Ef"K(4)] 
(l-p)*» 



V. 



given by 



= s 



p 1-p S n 



n p(l-p) n p(l-p) 1-p Ef«K«)] 
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The parameter p is therefore given implicitly by 

s n p Ef » K (<)] 1 



P = 



x 



Third, observe that 
which follows from 



S n Ef«K«)] 1-P 
1 



p(l-p) Ep 



Bin 



{4 -pSn}w n (s' n ) 



d P E? m k (oi = Err- d P {p s - (i - p) Sn ~ s ' n } >< ^ (O 



s' \ S ™ 



P( 



Thus 



p 



P 



{s' n ~ pS n } U n (S'J 



S n 



x 



1 — p 

Under the assumed condition that sup rf(:RrW cov „. ^ "'^(^"M) _ m the result fol- 
lows. ■ 



B.3. A Characterization of Consistency in terms of Variance. We say that a 
SERGM is expectations-identified if (3' ^ (3 implies that Ep[S] ^ Ep[S\. 

For simplicity, we present the univariate case, though the multivariate case follows 
simply by controlling the norm of the covariance matrix instead. 

Proposition B.2. Consider an expectations-identified SERGM. 
(1) For any e > such that e < var/3 ^ 



'e\ft-e,P+e] l> 



\P(S) -0\>e 



< 



e 2 vaip(S) ' 

(2) Conversely, let B > be such that var^S] < B and \S - Ep[S\\ < BEp[\S 
E/3'[S]\] for all (3' within m+TjB °f P- Then 



pn 
r a 



\P n (S) -0\> 



3(B + 1)B 



> 



2B - 1 



Consider a sequence of expectations-identified SERGMs indexed by the number of 
nodes n. We say that the sequence is consistent at j3 if there are sequences e n — > and 
r n — > oo such that 



r /3 



\P n (S) -/3\>e n 



< 



1 



Corollary B.l. Consider a sequence of expectations-identified SERGMs indexed by 
the number of nodes n. 
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(1) Ifva,r^[S] — > oo and Skew^fS 1 ] — > 0, then the sequence is consistent at (3. 

(2) Conversely, if there is some B > such that var^fS*] < B and \S — E^, [5*] | < 

BEp 1 , S — E^,[S] for all (3' within ^ B ^ B of (3 for large enough n, then the 
sequence is not consistent. 

To see an implication of this result, consider a SERGM defined on a count of "large" 
subgraphs, for instance the number of components that contain some given percentage 
of the nodes. The variance of such a statistic is necessarily bounded, and so such a 
SERGM cannot be consistent. 

Proof of Proposition B.2. 

First, note that since (3' ^ (3 implies that Ep[S] ^ Epi[S], and also since Ep[S] is 
continuous in (3, it must be that Ep[S] is monotone in (3. 
Let (3(S) be the MLE of (3 as a function of the statistic S. 

Note that from Section 3.4 we know that /3(s) is the f3 such that Ep[S] = s, which 
varies continuously in (3. 
Let us consider 

[\p(S)-0\>e 

Note that by Lemma B.2, it follows that s'((3) = va,rp(S), and also that s"{(3) = 
E, 



(s-E p [s\y 

Therefore, 



and 



Thus, 



s((3 + e) = s((3) + 
s((3 -e) = s(f3) - 



varp(S) + 
var^(5) + 



kl +y {S)dy 



k$- y (S)dy 



dx 



dx. 



mm[\s{p + e) - s{(3)\,\s{p - e) - s(p)\] >ewax p {S) - 



max 

,/3'e[/3-e,/3+e] 



and by the choice of e < 



var^ [S] 



> ev&Tp{S)/2. 



min [\s(/3 + e) - s((3)\, \s(/3 - e) - s( 
Given the monotonicity of Ep(S) in (3, it then follows that 

\(3(S) -(3\>e implies \S-E P (S)\> evaip(S)/2. 

Thus, 



\P(S)-0\>e 



\S-E n JS)\> evaxp{S)/2 



By Chebychev's inequality it then follows that 



\S-E n JS)\> evex p (S)/2 



< 



Therefore, 



as claimed. 



\/3(S)-(3\>e 



(eVar^(S)/2) 2 e 2 Var^(S)' 
4 

< 



e 2 Var^(S) 
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Now let us examine the converse statement in the proposition. 
Note that var^fS] < B , it follows that E[\S - E%{S)\] < B + 1, and therefore by 
assumption it also follows that \S — Ep(S)\ < (B + 1)5. 58 It then follows that 

var^(S) <\S - H$(S)\E[\S - E^S)\] < (B + 1)BE[\S - E%(S)\], 

and by similar reasoning 

k$[S] <\S- W;(S)\v^(S) <(B + 1) 2 B 2 E[\S - E-(S)\}. 

By similar reasoning to the proof of the first part of the proposition, we can deduce 
that 



P 



\P(S) -P\>e 



> P 



P 



\S-Ep{S)\> eVar^S) +e 2 



max 

.P'e[P-e,P+e] 



kJ,(S)| 



Therefore, 



\P(S) -P\>e 



> P 



P 



S - E n JS)\ > (e(B + 1)B + e\B + 1) 2 5 2 E[\S - E n JS)\] 



For e 



i 



3(B+1)B 



it follows that 



\P{S) -P\>e 



>P/ 



\S -E n JS)\>E[\S -E%S)\]/2 



Given that \S - E n JS)\ < BE[\S - E n g{S)\], it follows that there exists 5 



i 



2B-1 



such 



that 



> s. 



\S-E-(S)\>E[\S-E-(S)\}/2 

establishing the second part of the proposition. 

To see this last claim, let p = P^ \\S - E%(S)\ > E[\S - E%(S)\}/2 



and note that 



(l-p)E \S-E n JS)\ | \S-E n JS)\<E[\S-E"JS)\]/2+pE \S-E n JS)\ \ \\S - E n JS)\ > E[\S - E2(, 



E[\S-E n JS)\] 



and so 



(1 - p)E[\S - E-(S)\}/2 + P BE[\S - E%(S)\] >E[\S- E%(S)\]. 
Therefore, p > 5 = 2B 1 _ 1 , as claimed. ■ 

Proof of Corollary B.l. 

The second claim follows directly from Proposition B.2. 

To see the first claim, let us consider two cases. First consider a case such that 
Var^(S)/\kl(S) n \ is bounded away from 0. In that case apply the first part of Propo- 
sition B.2, with e n = (Var'p(S))~ a for any a < 1/2. Next consider a case such that 

var^(5)/|^(5) n | 0. In that case, set e n 



var'HS') 

— - — In that case, we need to check 



that 



\k$(sr\ 



— > oo, which is equivalent to Skew^(S) — > 0. 



58 To see that E^S 1 — E^S)!] < B + 1, note that for a nonnegative random variable X, 
and 

and the claim follows. 



E[X] < P[X < 1]1 +P[X > l]EpY|X > 1] 

p[x > i]E[x\x > i] < P[x > i]v[x 2 \x > i] < e[a: 2 ], 
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The exponential family gives us a useful relationship between s and its cumulants 
that is useful in what follows. 

Lemma B.2 (Cumulant expansion). s^(/3) is the (q— l)th cumulant of S. 

Proof of Lemma B.2. This follows by inspection. For exponential families Ks(t) = 
log ( ) • Then the fact that the moment generating function can be written as 

Ms(t) = Y,s K $(p) ^' s e t ' s = an d the cumulant generating function is the log of 

the moment generating function implies the result. ■ 



